REPARAMETERIZATION THROUGH SPATIAL GRADIENT SCALING

Abstract

Reparameterization aims to improve the generalization of deep neural networks by transforming convolutional layers into equivalent multi-branched structures during training. However, there exists a gap in understanding how reparameterization may change and benefit the learning process of neural networks. In this paper, we present a novel spatial gradient scaling method to redistribute learning focus among weights in convolutional networks. We prove that spatial gradient scaling achieves the same learning dynamics as a branched reparameterization yet without introducing structural changes into the network. We further propose an analytical approach that dynamically learns scalings for each convolutional layer based on the spatial characteristics of its input feature map gauged by mutual information. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that without searching for reparameterized structures, our proposed scaling method outperforms the state-of-the-art reparameterization strategies at a lower computational cost.

1. INTRODUCTION

The ever-increasing performance of deep learning is largely attributed to progress made in neural architectural design, with a trend of not only building deeper networks (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014) but also introducing complex blocks through multi-branched structures (Szegedy et al., 2015; 2016; 2017) . Recently, efforts have been devoted to Neural Architecture Search, Network Morphism, and Reparametrization, which aim to strike a balance between network expressiveness, performance, and computational cost. Neural Architecture Search (NAS) (Elsken et al., 2018; Zoph & Le, 2017) searches for network topologies in a predefined search space, which often involves multi-branched micro-structures. Examples include the DARTS (Liu et al., 2019) and NAS-Bench-101 (Ying et al., 2019) search spaces that span a large number of cell (block) topologies which are stacked together to form a neural network. In Network Morphism (Wei et al., 2016; 2017) , a well-trained parent network is morphed into a child network with the goal of adopting it on a downstream application with minimum re-training. Morphism preserves the parent network's functions and output while yielding child networks that are deeper and wider. Structural reparameterization (Ding et al., 2021c) attempts to branch and augment certain operations during training into an equivalent but more complex structure with extra learnable parameters. For example, Asymmetric Convolution Block (ACB) (Ding et al., 2019) augments a regular 3x3 convolution with both horizontal 1 × 3 and vertical 3 × 1 convolutions, such that training is performed on the reparameterized network which takes advantage of the changed learning dynamics. During inference, the trained reparameterized network is equivalently transformed back to its base simple structure, preserving the original low inference time, while maintaining the boosted performance of the reparameterized model. However, there exists a gap regarding the understanding of how and when the different learning dynamics of a reparameterized model could help its training. In addition, the search for the optimal reparameterized structure over a discrete space (Huang et al., 2022) inevitably increases the computational cost in deep learning. In this paper, we propose Spatial Gradient Scaling (SGS), an approach that changes learning dynamics as with reparameterization, yet without introducing structural changes to the neural network. We examine the question-Can we achieve the same effect as branched reparameterization on convolutional networks without changing the network structure? Our proposed spatial gradient scaling learns a spatially varied scaling for the gradient of convolutional weights, which we prove to have the same learning dynamics as branched reparameterization, without modifying network structure. We further show that scaling gradients by examining the spatial dependence of neighboring pixels in the input (or intermediate) feature maps can boost neural network learning performance, without searching for the reparameterized form. Our main contributions can be summarized as follows: • We investigate a new problem of spatially scaling gradients in the kernels of convolutional neural networks. This method enhances the performance of existing architectures only by redistributing learning rates spatially, i.e., by adaptively strengthening or weakening the gradients of convolution weights according to their relative position in the kernel. • We mathematically establish a connection between the proposed SGS and parallel convolutional reparameterization, and show their equivalence in learning. This enables an understanding of how the existing multi-branched reparameterization structures help improve feature learning. This interpretation also suggests an architecture-independent reparameterization method by directly inducing the effect via scaled gradients, bypassing the need for complex structure augmentations, and saving on computational costs. • We propose a lightweight method to compute gradient scalings for a given network and dataset based on the spatial dependencies in the feature maps. Specifically, we make novel use of the inherent mutual information between neighboring pixels of a feature map within the receptive field of a convolution kernel to dynamically determine gradient scalings for each network layer, with only a minimum overhead to the original training routine. Extensive experiments show that the proposed data-driven spatial gradient scaling approach leads to results that compete or outperform state-of-the-art methods on several image classification models and datasets, yet with almost no extra computational cost and memory consumption during training that multi-branched reparameterization structures require.

2. RELATED WORK

2.1 MULTI-BRANCH STRUCTURES VGG (Simonyan & Zisserman, 2014 ) is a base model for several computer vision tasks. Due to its limitations, several new structures have been proposed with multiple branches to achieve higher performance. GoogleNet (Szegedy et al., 2015) and Inception (Szegedy et al., 2015; 2016; 2017) architectures deploy multi-branch structures to enrich the learned feature space. ResNet (He et al., 2016) uses a simplified two-branch structure that adds the input of a layer to its output through residual connections. The improvements in top-1 accuracy of ImageNet classification using these structures demonstrate the importance of multiple receptive fields (e.g., 1 × 1, 1 × K, K × 1, and K × K convolutions), diverse connections of layers and combination of parallel branches. These performance improvements often come at a computational cost, as complex model topologies are less hardware friendly, and have increased computational requirements. Outside of expertly designed networks, advancements in Neural Architecture Search (NAS) allow for the automation of network design. Several search spaces and discovered high-performing networks (Ying et al., 2019; Dong & Yang, 2020; Ding et al., 2021a) utilize multi-branch structures, which shows their ubiquity in modern convolution architectures. Due to the enormous possibilities of branched model topologies, search is often computationally expensive and requires vast computational resources.

2.2. STRUCTURAL REPARAMETRIZATION

Multi-branch structures enhance the performance of ConvNets. This comes at the cost of higher memory and computational power requirements, which is undesirable for inference-time applica-tions. Structural reparameterization solves this by training with a complex multi-branch model to improve the learned representations but equivalently transforming back to the original simple base model during inference for decreased computational costs. RepVGG (Ding et al., 2021c) introduces a family of VGG-like inference models that are trained with ResNet-inspired reparameterization blocks. These reparameterization blocks consist of parallel 3 × 3 and 1 × 1 convolutions along with an identity branch. After training, using the linearity of convolutions, parallel branches are equivalently transformed into a single 3 × 3 convolution. The inference VGG-like model has the advantage of both the enhanced learned representations of the complex reparameterized model, and the fast and efficient inference of the simple base model. Similarly, DBB (Ding et al., 2021b) structurally reparameterize models by replacing all K × K convolution with a multi-branch topology composed of multi-scale and sequential 1 × 1-K × K convolutions and average pooling during training. After training, DBB blocks are equivalently reparameterized back to K × K convolutions for efficient inference. Instead of structurally reparameterizing all convolutions, DyRep (Huang et al., 2022) aims to selectively reparameterize only important convolutions to improve the training efficiency of the reparameterized model. Our spatial gradient scaling approach has the benefit of a branched reparameterization without the added training cost of an augmented network structure. ACNet (Ding et al., 2019) , through pruning experiments, showed that convolution weights on the central crisscross positions of the 3 × 3 kernels are more important to the model's representational capacity than corner weights. To further enhance the kernel crisscross's importance, they reparameterize, during training, 3 × 3 convolutions with their Asymmetric Convolution Blocks (ACB). ACB comprises of parallel 3 × 3, 1 × 3, and 3 × 1 convolutions. They found that models trained with ACB reparameterization perform better than the base model. Like ACNet, we also emphasize the importance of kernel central positions. However, instead of using ACB blocks, which add significant training cost, we scale the gradients of convolution weights with a spatially varying gradient scaling. In fact, by using spatial gradient scaling, we can emulate the presence of multi-branch topology without adding to the structure and computational cost of the training model.

2.3. FEATURE SELECTION WITH MUTUAL INFORMATION

Mutual information has extensive applications in the domain of computer vision and medical imaging. Mutual information is used by (Pluim et al., 2003) . Viola & Wells III (1997) as a metric for comparing the alignment of a 3D model to video images. Russakoff et al. (2004) use mutual information to measure similarity between images. In deep learning, Cheng et al. (2018) used the mutual information between inputs, outputs, and target labels of a neural network to infer its power of distinguishing between classes. In this paper, we use mutual information in a novel way to capture dependencies between neighboring elements within a feature map. We use this spatial information as a dynamic scale for adjusting the importance of spatial positions in a convolution kernel.

3. METHOD

In this section, we first introduce spatial gradient scaling for convolutional neural networks. We then establish its connection to reparameterization and mathematically show their equivalence. Finally, we describe our mutual-information-based approach to dynamically determine spatial gradient scaling during the training process at a low computational cost.

3.1. SPATIAL GRADIENT SCALING

Gradient scaling adjusts backpropagation by strengthening or diminishing gradients of learnable parameters based on the significance of their position in the convolutional kernel. Let W (t) l be the learnable weights for the l-th convolutional layer at training iteration t. The shape of W (t) l is denoted by (c out l , c in l , k x l , k y l ), where c out l and c in l are sizes of output and input channels, respectively, and (k x l , k y l ) denotes the kernel size. During each iteration of training, we backpropagate the training loss L train to calculate the gradients of the learnable parameters ∂L/∂W (t) l . Following matrix calculus notations, let ∂L/∂W denote a derivative tensor whose element at the position indexed by (m, n, o, p) is given by ∂L/∂W m,n,o,p . Then, a gradient descent optimization step for the convolutional layer l with weights W l can be represented by W (t+1) l ⇐ W (t) l -λ (t) f ∂L ∂W (t) l , W (t) l , ..., ∂L ∂W (0) l , W (0) l , where λ (t) is learning rate and f (•) is an optimizer dependent element-wise function of the current and past gradients and weights. Our approach scales the gradient by a spatial gradient scaling matrix G (t) l , to yield the following gradient descent update rule: W (t+1) l ⇐ W (t) l -λ (t) f G (t) l ⊙ ∂L ∂W (t) l , W (t) l , ..., G ⊙ ∂L ∂W (0) l , W (0) l , where G (t) l is a matrix of shape (k x l , k y l ) and ⊙ denotes element-wise multiplication along dimensions c out l and c in l . We additionally constrain G (t) l to be strictly positive with a mean of 1 to prevent large changes in the overall gradient direction and magnitude. To account for various optimizers, we scale the gradient before any optimizer-dependent calculations like momentum and weight decay.

Elements of G (t)

l are learned in three steps. First, we define each element in G (t) l by its displacement from the center element. Next, we measure the average spatial relatedness between every two pixels in the input feature map that are particularly displaced apart. We denote this as the average spatial dependence and define it mathematically as a function of the feature map and displacement in Section 3.3. Finally, we assign values to elements in G (t) l based on the average spatial dependence of their displacement. The overall process is depicted in Figure 1 . We assign higher learning priority to elements with higher average spatial dependence. Feature maps with a high spatial correlation over large displacements give rise to more uniform spatial gradient scalings. Feature maps with low spatial correlations yield center concentrated scalings.

3.2. EQUIVALENCE TO REPARAMETERIZATION

We now establish the relationship between the proposed spatial gradient scaling and parallel convolution reparameterization. Specifically, we show that backward propagation for a single convolution is different from that of its N -branch reparameterization, where the latter is equivalent to updating the original convolution with certain spatial gradient scaling. Consider a k x l × k y l convolutional layer l with trainable weights W (t) l ∈ R cout l ×cin l ×kx l ×ky l , input X (t) l and output Y (t) l at timestep t in training. Following Ding et al. (2019) we can reparameterize this single convolutional layer into a general N -branch convolutional structure as depicted in Figure 2 (b) with batchnorms after the reparameterization instead of within. Each branch n contains a convolution with a receptive field no larger than (k x l , k y l ). To mathematically represent variablesized kernels in different branches, we let M l,n ∈ {0, 1} kx l ×ky l denote a binary mask, which is a matrix in the shape of the corresponding kernel's receptive field, as illustrated in Figure . 2 (c). Considering a forward pass, for each original convolutional layer l with weights W (t) l , we can find an equivalent N -branch reparameterization, where each branch n contains a convolutional weight  Y (t) l = W (t) l * X (t) l = N l n=1 M l,n ⊙ W (t) l,n * X (t) l = N l n=1 M l,n ⊙ W (t) l,n * X (t) l , where ⊙ denotes element-wise multiplication across each dimension c out l and c in l . The second equality in Eq. 3 decomposes the weight tensor W l , while the third equality achieves branched structural form. The equivalence between a convolutional layer and its N -branch reparameterization with masked representation for each branch is illustrated in Figure 2 (c ). However, in backward passes, W l updates differently in the reparameterized network than in its original form. That is, one step of gradient descent on the reparameterization in Eq. 3 has the following form to yield the merged weights W (t+1) l for the next timestep: W (t+1) l ⇐ W (t) l -λ (t) N l n=1 M l,n ⊙ f ∂L ∂W (t) l,n , W (t) l,n , ..., ∂L ∂W (0) l,n , W (0) l,n Note that the learning dynamics for the reparameterized network differ from the original convolution in Eq. 1, which explains why it may attain higher generalization although having identical expressivity. Despite their different topologies, we show in the following lemma that updates for the original convolution (Eq. 1) and any of its N -branch reparameterization (Eq. 4) differ only by a constant spatial gradient scaling G l . Lemma 1. Assume f (•) is a linear function in a gradient descent optimization algorithm (e.g., momentum, weight decay). For any reparameterization of a convolutional layer l that can be represented as a summation of N convolutional branches with weights W (t) l,n and binary receptive field mask M l,n , for n = 1, . . . , N , its gradient descent update Equation 4 is equivalent to W (t+1) l ⇐ W (t) l -λ (t) f G l ⊙ ∂L ∂W (t) l , W (t) l , ..., G l ⊙ ∂L ∂W (0) l , W (5) where G l = N n=1 M l,n is a spatial gradient scaling applied to the original convolution. The proof in Appendix A follows readily from the equations of gradient descent and the linearity of convolutions. Lemma 1 has multiple implications. First, it provides an understanding of how branched reparameterization helps to change the backpropagation dynamics; it redistributes learning rates spatially to focus on more important weights in a convolutional kernel. Second, Lemma 1 allows us to convert the search for a reparameterization structure into an equivalent numerical gradient scaling search, which is more efficient and lends itself to analytical methods (as we demonstrate in Section 3.3). Our scaling interpretation of structural reparameterization allows us a more flexible search space unconstrained by the computational complexity of the underlying structure. Third, we find agreement between our formalized gradient scaling approach and observational rules of thumb in the reparameterization literature. For example, the current trend of preferring branches with a diverse range of convolutional receptive fields presented by Ding et al. (2021b) may stem from the fact that without it, the gradient scaling is uniform and loses its spatial emphasis.

3.3. MUTUAL INFORMATION BASED SPATIAL GRADIENT SCALING

Figure 3 : Overview of SGS calculation from the spatial dependencies in the input feature map. (a) a joint distribution is estimated for every pixel in the feature map and its (i, j) neighbor denoted by random variables P and Q i,j respectively (b) Mutual information is calculated between P and Q i,j for all (i, j) and values are placed in spatial dependence matrix (S l ) (c) S l is transformed to SGS (G l ) by equation 8 (d) G l is element-wise multiplied to the kernel gradients. We describe our approach to finding spatial gradient scalings from the spatial dependencies in the input feature map. We begin by showing how mutual information can be used to quantify the average spatial dependency between (i, j)-displaced pixels. We then assign values to elements of our gradient scalings based on their spatial displacement to the center of the kernel and its associated average spatial dependency. Figure 3 shows an overview of the method. We quantify the spatial dependence of two pixels through the use of probabilistic dependence. We define random variables P l and Q l,i,j as pixel values in the l-th layer feature map and their associated (i, j)-displaced neighbor values, respectively. The (i, j)-displaced neighbor for a pixel is the corresponding pixel (i, j) units away. We express spatial dependence, S l (i, j), as the normalized mutual information (MI), Ĩ(•; •) of the random variables P l and Q l,i,j : S l (i, j) = Ĩ(P l ; Q l,i,j ), Displacement (i, j) spans the entire (k x l , k y l ) receptive field of the convolution kernel as demonstrated in Figure 3 (a). We arrange elements of S l (i, j) into a spatial dependence matrix as illustrated in Figure 3 (b). Intuitively, spatial dependency within pixels results in statistical dependencies in pixel values, which can be detected via mutual information. Ĩ(•) is bounded between 0, when variables are independent, and 1, with complete mutual dependence. We calculate the normalized MI from the Shannon entropy H(•) of the random variables: Ĩ(P l , Q l,i,j ) = H(P l ) + H(Q l,i,j ) -H(P l , Q l,i,j ) H(P l , Q l,i,j ) where H(P l , Q l,i,j ) is the joint entropy of P and Q i,j . We calculate the entropy by estimating the distributions of P l and Q l,i,j through discrete binning. High-resolution images often contain redundant / spatially repeated neighbor pixels that may cause mutual information to overestimate the amount of useful learnable spatial dependence. To account for this, on ImageNet, we remove occurrences of P l with Q l,i,j that are near in pixel values when we estimate their distributions. In practice, a single image is not large enough to generate accurate distributions for P l and Q l,i,j , so we aggregate pixel and their (i, j)-neighbors values over multiple batches of training data. Due to the inexpensive nature of the mutual information computation, we can afford to perform this calculation for each convolutional layer and every couple of epochs. To get the spatial gradient scaling, G l , we transform the spatial dependency matrix S l with an element-wise transform parameterized by a hyperparameter k: G l = k × S l (k -1)S l + 1 , k converts mutual information values into effective gradient scalings. Finally, we normalize the mean value of G l . An overview of the SGS framework is given as pseudo-code in Appendix A.5, and details can be found in the corresponding open-source codefoot_0 . In short, our scaling defines how significant weight elements are for feature extraction based on their spatial location within the kernel. In order to determine the scaling, we use mutual information to measure the notion of spatial dependence between pixels a distance apart. We give high learning priority to the elements with a large spatial dependence on the center kernel element.

4. EXPERIMENTS AND RESULTS

In this section, we assess the effectiveness of spatial gradient scaling in improving model generalization ability. Following convention (Huang et al. (2022) 

4.1. CIFAR

We train VGG-16 on CIFAR-{10,100} for 600 epochs with a batch size of 128, cosine annealing scheduler with an initial learning rate of 0.1, and SGD optimizer with momentum 0.9 and weight decay 1 × 10 -4 . We update our spatial gradient scalings every 30 epochs using 20 random batches from the training set. We add a 1 epoch warm-up period at the start of training before generating our gradient scalings. Results are shown in Table 3 . Additional results are available in Appendix A.1. Our framework uses a single hyperparameter k, which defines a functional mapping between mutual information and spatial gradient scaling. We search for k on CIFAR100 and use the optimal for experiments on CIFAR10 and ImageNet. We perform a grid search on CIFAR100 and VGG-16 over k ∈ {2, 3, 4, 5, 6, 7} Our spatial gradient scaling performs equally or better than state-of-the-art reparameterization methods at a fraction of their cost. On CIFAR10, our spatial gradient scaling outperforms DBB and performs as well as DyRep while only requiring a third of their training time. On CIFAR100, we obtain over 1% accuracy improvement from DBB and DyRep while taking less than half their GPU hours. We attribute the success of spatial gradient scaling to its enhanced reparameterization space and strategy. Unlike structural reparameterization methods like DBB and DyRep, we are not limited by the computational complexity of our blocks, which enables us to explore a much larger space of reparameterizations. Our formalism of spatial gradient scaling as an equivalent to reparameterization also enables us to perform a search in an easily implementable continuous space as opposed to a discrete structural one. Our mutual information strategy adaptively reparameterizes each convolution throughout the training process efficiently and effectively (as we show in Section 4.3).

4.2. IMAGENET

We train the ResNet models for 120 epochs, with a batch size of 256, cosine annealing scheduler with initial lr of 0.1, color jitter augmentation, and SGD with a momentum of 0.1 and weight decay 1 × 10 -4 . Scalings update every 5 epochs using two random training batches after a one-epoch model warm-up. Hyperparameter k is taken as the optimal found from the CIFAR100 grid search. Results are presented in Table 2 . As with CIFAR, we see significantly reduced training times with our spatial gradient scaling method for equal or better accuracy compared to the state-of-the-art reparameterization methods. The benefits of reparameterization are attained without complicating the model structure with expensive reparameterization blocks. Large convolution kernels, like the 7 × 7 used in ResNet, are difficult to structurally reparameterize. First, these large convolutions are expensive in terms of compute and memory. The addition of a parallel reparameterization branch only increases its computational cost further. Second, as convolution size increases, so does the number of possible diverse receptive fields (for example, 7 × 7's receptive fields are: 1 × 1, 1 × 3, ..., 3 × 5, ...). DBB and DyRep can only consider a tiny fraction of the possible set (7 × 7, 1 × 7, 7 × 1, 1 × 1). Our spatial gradient scaling can consider all possible receptive fields, even non-standard ones, through our general binary masks. In addition, our method completely avoids the computational pitfalls of structural reparameterization, as our reparameterization happens efficiently on the gradient level. Table 3 : Results of VGG-11 on CIFAR-100 using different spatial gradient scaling search methods.

4.3. ABLATION STUDIES

Effectiveness of Mutual Information Approach. We demonstrate our mutual information approach's efficacy in finding high-performance spatial gradient scalings. We compare to autocorrelation, another commonly used dependency measure, as well as a grid search for gradient scalings. We train all methods on CIFAR100 with VGG-11 for 200 epochs with a batch size of 512, cosine annealing scheduler with an initial learning rate of 0.1, and SGD optimizer with momentum 0.9 and weight decay 5 × 10 -4 . For autocorrelation and mutual information, we update our spatial gradient scalings every 10 epochs using 2 random batches from the training set and a warmup of 1 epoch. Similar to mutual information, we can measure spatial dependencies using autocorrelation. Specifically, we calculate the correlation of a feature map with itself shifted by (i, j), where (i, j) indexes into a spatial dependency matrix (Figure 3 .3). We use the same k transform to map autocorrelation values into gradient scalings. We perform a grid search on over k ∈ {1, 2, 3, 4, 5, 6} using 20% of the training set for validation. For mutual information, we use the optimal k found in Section 4. We take several considerations for grid search over spatial gradient scaling to make the search tractable. First, we reduce the search space by considering a single 3×3 gradient scaling shared by all convolutional layers and constant for all training epochs. We additionally parameterize the scaling matrix by two variables, α, and β, which determine the ratio of the center element scaling to the edges and corners respectively (shown in Appendix 7). We search over α, β ∈ {0.8, 1.0, 1.25, 1.7, 5.0, 10, 100}. Our mutual information approach (SGS) outperforms both autocorrelation and grid search. While grid search is robust, it suffers from high computational complexity, which requires designing a constrained search space. Unlike SGS, grid search cannot effectively adapt across convolution depth and time without an exponential blowup in the search space. While autocorrelation outperforms grid search, our mutual information method performs better. We attribute this to the fact that autocorrelation can only measure linear relationships in the feature map, while mutual information can measure both linear and non-linear dependencies. k-Transformation Search. In this section, we investigate the behavior of the k hyperparameter across models and datasets. Additionally, we corroborate our decision in Section 4 to learn k once on CIFAR-100 and transfer to ImageNet. Following the training strategy defined in Section 4, we train and evaluate models over a range of k values and plot the results in Appendix Figure 8 . We observe that k curves across models and datasets peak near k = 5. This may imply that k = 5 is a robust default value. We also find consistent performance gains over baseline for a large range of k values. This implies that the conventional uniform update of weights is suboptimal, and lower testing error can be attained via spatial gradient scaling.

5. CONCLUSION

In this paper, we present Spatial Gradient Scaling (SGS), an approach that improves the generalization of neural networks by changing the learning dynamics to focus on spatially important weights. We achieve this by scaling the convolutions gradients adaptively from the spatial dependencies of feature maps. We propose a mutual information-based approach to compute the gradient scaling with minimum overhead to the original training routine. We prove that our SGS is equivalent to convolutional reparameterization under certain conditions. This enables us to take advantage of the benefits of reparameterization without introducing complex branching into model structures. Experiments show that our method outperforms the state-of-the-art structural reparameterization approaches on several image classification models and datasets at a much lower computational cost. Table 4 : Results for models trained on CIFAR-100 with and without spatial gradient scaling. Results are averaged over 3 independent replicas.

A.2 COMPARISONS TO ADAPTIVE GRADIENT OPTIMIZERS

In this section, we compare spatial gradient scaling to popular adaptive gradient optimizers. Unlike adaptive optimizers, which typically optimize based on the model weights and gradients of previous timesteps, our gradient scaling uses the spatial properties within the training data to effectively scale convolution gradients. In Table 5 , we present CIFAR-100 results for various optimizers with and without spatial gradient scaling. We tune optimizer hyperparameters with a grid search using 20% of the training data for validation. We focus our search on learning rate and weight decay, leaving other optimizer settings as PyTorch defaults. Training and SGS settings, outside of searched optimizer hyperparameters, are identical to those described in Appendix A.1. We find that spatial gradient scaling improves the performance of even the highest-performing optimizer, SGD + Momentum. Moreover, all tested optimizers, Adagrad being the only exception, benefitted from spatial gradient scaling. Even in Adagrad's case, we can find performance gain by applying gradient scaling postoptimizer calculations and right before the weight update step (shown as Adagrad*) as opposed to directly to the back-propagated gradient and before optimizer calculations.

A.3 SENSITIVITY TO DIFFERENT TRAINING SETTINGS

In this section, we empirically study performance improvement by spatial gradient scaling across training hyperparameters. Results are shown in Table 6 . Training and SGS settings, outside of those in the ablation study, are identical to Appendix A.1. We find performance improvements over the baseline under all tested hyperparameter configurations. This suggests that SGS can be used effectively to improve ConvNets training without requiring extensive hyperparameter tweaking.

A.6 LEARNED SPATIAL GRADIENT SCALING

In Figure 4 , we plot the spatial gradient scalings for the first, seventh, and last convolutional layers of VGG-16 at the start and the end of its training on CIFAR-100. We observe more uniformly distributed gradient scaling for beginning layers giving equal relative importance to all kernel weights. Deeper layers have center-focused gradient scalings with most of the importance distributed on the center and edge elements as opposed to the corners. Interestingly, we observe that after training, spatial gradient scalings for deeper layers become even more center-focused, indicating a decrease in the spatial dependencies of the feature map. We plot gradient scalings for ResNet18 on ImageNet in Figure 5 . Like CIFAR, we see uniform spatial gradient scalings in early layers and center-focused scaling in deeper layers. Contrary to CIFAR, however, we find that gradient scalings "smoothen" over time and become more uniform. Discrepancies between the behavior of spatial gradient scaling on CIFAR and ImageNet warrant future investigation. 2019), we investigate the average kernel magnitude matrix of learned weights with and without our spatial gradient scaling. For a convolution weight, the average kernel magnitude matrix is defined as the mean of the absolute value of the weight tensor across the input and output channels (leaving the spatial channels intact). We further normalize the mean of the matrix for meaningful comparisons. trained kernel magnitude matrix to focus more on the center and edge elements. We also qualitatively observe the similarities between the shape of spatial gradient scaling and the final trained weights.

A.8 PROOF FOR EQUIVALENCE TO REPARAMETERIZATION

In this section we prove by induction lemma 1 for optimizers that are linear functions of current and past weights and gradients i. are the weights and respective gradients for the l-th convolutional layer at training iteration t and f (•) is a linear optimizer parameterized by arbitrary γ (τ ) and ζ (τ ) . We begin by defining two models: a base convolutional layer, and its n-branch reparameterization which at t = 0 has an identical mapping: Y l = W (0) l * X l = N n=1 M l,n ⊙ W (0) l,n * X l = N n=1 M l,n ⊙ W (0) l,n * X l Our goal is to find a modified update rule for the single convolution such the mappings are identical throughout training i.e., ∀(t ≥ 0) the following holds: W (t) l = N n=1 M l,n ⊙ W (t) l,n Assume equation 9 holds ∀(t ≤ t 0 ) (we know this is for sure the case for t 0 = 0). We wish to find an update rule for the single convolution such that equation 9 holds for t = t 0 + 1 and thus ∀(t ≤ (t 0 + 1)). Together with our assumption of identical mapping at t = 0, we can then ensure equation 9 holds ∀(t ≥ 0). We then show that when equation 9 holds for t then: ∂L ∂W First we demonstrate that the convolution gradient is only a function of the input tensor, X (t) and the output gradient, ∂L/∂Y (t) , and not of the weight W . We make use of the tensor index and Einstein summation notation: (t) , X (t)  Y (t) = W (t) * X (t) Y (c o , h, w) (t) = Ci-1 ci=0 K H -1 k h =0 K W -1 kw=0 W (c o , c i , k h , k w ) (t



https://github.com/Ascend-Research/Reparameterization https://github.com/hszhao/semseg https://github.com/xingyizhou/pytorch-pose-hg-3d



Figure 1: Overview of the framework for learning spatial gradient scalings. In a), we show the kernel receptive field along with elements and their associated pixel distance to the center. We generate a discrete average spatial dependence vs. pixel distance function (c) from the input feature map. Using (c) and pixel distances in (a), we generate the spatial gradient scaling (d). Note that we simplify the process in the figure by considering pixel distance instead of displacement.

Figure 2: An illustration of the equivalence between SGS and structural reparameterization. The base 3x3 convolution, (a), is structurally reparameterized into (b), a 3-branched reparameterization with diverse receptive fields. (c) shows the equivalent binary masked convolutions. (d) shows the equivalent spatial gradient scaling, which is the sum over the binary masks. The gradient is elementwise multiplied by the spatial gradient scaling before passing to the optimizer.

, Ding et al. (2021b)), we compare test accuracies for ResNet and VGG models trained under state-of-the-art reparameterization schemes. We adopt the model training code and strategy from Huang et al. (2022).

Figure 4: Spatial gradient scaling for the first, seventh, and last convolutional layer of VGG16 on CIFAR-100 at the beginning and end of training.

Figure 5: Spatial gradient scaling for the first, eight, and last convolutional layers of ResNet18 on ImageNet at the beginning and end of training. The first layer is a 7 × 7 convolution.

Figure6depicts the average kernel magnitude matrix for the first, seventh, and last convolutional layer of VGG-16 trained on the CIFAR-100 dataset with and without spatial gradient scaling. We additionally show the spatial gradient scaling of the last training epoch of the SGS training scheme. Similar toDing et al. (2019), we observe that our spatial gradient scaling modifies the normally

Published as a conference paper at ICLR 2023 Assuming equation 9 holds ∀(t ≤ t 0 ) we find the update equation for the merged weights W t0+1 l

Results for VGG-16 on CIFAR-10 and CIFAR-100 trained using the official implementation of DyRep(Huang et al., 2022). Training is done on a single NVIDIA Tesla V100 GPU. FLOPs and Parameters are averaged across DyRep runs. Results marked with * are taken from the official DyRep paper, while the rest are our runs averaged over 5 independent replicas.

Results on ImageNet dataset. We use the official implementation of DyRep(Huang et al., 2022) on 8 NVIDIA Tesla V100 GPUs. FLOPs and Parameters are averaged across DyRep runs. Results marked with * are taken from DyRep paper; the rest are our runs averaged over 3 seeds.

) X(c i , h + k h , w + k w ) (t) k,co δ l,ci δ m,k h δ n,kw X

availability

//github.com/Ascend-Research/Reparameterization.

annex

for e = 0, ..., (E -1) do if e divisible by N e then b ← randomly sample N b batches from X Forward propagate b and assign the input feature map of convolution layer l to the set X l for l in L do for (i, j) in kernel size of l do P, Q (i,j) ← List of pixels of X l and their corresponding (i,j)th neighbours SGS l,(i,j) ← Mutual Information between P and (i, j) neighbour Q (i,j) for (x, y) in (X, Y ) do Forward and backward propagate (x, y) for l in L do for (i, j) in kernel size of l do Scale the lth convolution spatial (i, j) gradient element with SGS l,(i,j) Update weights with gradients Given equation 9 we can now show:which implies that:Finally, picking up where we left off:We arrive at lemma 1. 

