REPARAMETERIZATION THROUGH SPATIAL GRADIENT SCALING

Abstract

Reparameterization aims to improve the generalization of deep neural networks by transforming convolutional layers into equivalent multi-branched structures during training. However, there exists a gap in understanding how reparameterization may change and benefit the learning process of neural networks. In this paper, we present a novel spatial gradient scaling method to redistribute learning focus among weights in convolutional networks. We prove that spatial gradient scaling achieves the same learning dynamics as a branched reparameterization yet without introducing structural changes into the network. We further propose an analytical approach that dynamically learns scalings for each convolutional layer based on the spatial characteristics of its input feature map gauged by mutual information. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that without searching for reparameterized structures, our proposed scaling method outperforms the state-of-the-art reparameterization strategies at a lower computational cost.

1. INTRODUCTION

The ever-increasing performance of deep learning is largely attributed to progress made in neural architectural design, with a trend of not only building deeper networks (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014) but also introducing complex blocks through multi-branched structures (Szegedy et al., 2015; 2016; 2017) . Recently, efforts have been devoted to Neural Architecture Search, Network Morphism, and Reparametrization, which aim to strike a balance between network expressiveness, performance, and computational cost. Neural Architecture Search (NAS) (Elsken et al., 2018; Zoph & Le, 2017) searches for network topologies in a predefined search space, which often involves multi-branched micro-structures. Examples include the DARTS (Liu et al., 2019) and NAS-Bench-101 (Ying et al., 2019) search spaces that span a large number of cell (block) topologies which are stacked together to form a neural network. In Network Morphism (Wei et al., 2016; 2017) , a well-trained parent network is morphed into a child network with the goal of adopting it on a downstream application with minimum re-training. Morphism preserves the parent network's functions and output while yielding child networks that are deeper and wider. Structural reparameterization (Ding et al., 2021c) attempts to branch and augment certain operations during training into an equivalent but more complex structure with extra learnable parameters. For example, Asymmetric Convolution Block (ACB) (Ding et al., 2019) augments a regular 3x3 convolution with both horizontal 1 × 3 and vertical 3 × 1 convolutions, such that training is performed on the reparameterized network which takes advantage of the changed learning dynamics. During inference, the trained reparameterized network is equivalently transformed back to its base simple structure, preserving the original low inference time, while maintaining the boosted performance of the reparameterized model. However, there exists a gap regarding the understanding of how and when the different learning dynamics of a reparameterized model could help its training. In addition, the search for the optimal reparameterized structure over a discrete space (Huang et al., 2022) inevitably increases the computational cost in deep learning. In this paper, we propose Spatial Gradient Scaling (SGS), an approach that changes learning dynamics as with reparameterization, yet without introducing structural changes to the neural network. We examine the question-Can we achieve the same effect as branched reparameterization on convolutional networks without changing the network structure? Our proposed spatial gradient scaling learns a spatially varied scaling for the gradient of convolutional weights, which we prove to have the same learning dynamics as branched reparameterization, without modifying network structure. We further show that scaling gradients by examining the spatial dependence of neighboring pixels in the input (or intermediate) feature maps can boost neural network learning performance, without searching for the reparameterized form. Our main contributions can be summarized as follows: • We investigate a new problem of spatially scaling gradients in the kernels of convolutional neural networks. This method enhances the performance of existing architectures only by redistributing learning rates spatially, i.e., by adaptively strengthening or weakening the gradients of convolution weights according to their relative position in the kernel. • We mathematically establish a connection between the proposed SGS and parallel convolutional reparameterization, and show their equivalence in learning. This enables an understanding of how the existing multi-branched reparameterization structures help improve feature learning. This interpretation also suggests an architecture-independent reparameterization method by directly inducing the effect via scaled gradients, bypassing the need for complex structure augmentations, and saving on computational costs. • We propose a lightweight method to compute gradient scalings for a given network and dataset based on the spatial dependencies in the feature maps. Specifically, we make novel use of the inherent mutual information between neighboring pixels of a feature map within the receptive field of a convolution kernel to dynamically determine gradient scalings for each network layer, with only a minimum overhead to the original training routine. Extensive experiments show that the proposed data-driven spatial gradient scaling approach leads to results that compete or outperform state-of-the-art methods on several image classification models and datasets, yet with almost no extra computational cost and memory consumption during training that multi-branched reparameterization structures require.

2. RELATED WORK

2.1 MULTI-BRANCH STRUCTURES VGG (Simonyan & Zisserman, 2014) is a base model for several computer vision tasks. Due to its limitations, several new structures have been proposed with multiple branches to achieve higher performance. GoogleNet (Szegedy et al., 2015) and Inception (Szegedy et al., 2015; 2016; 2017) architectures deploy multi-branch structures to enrich the learned feature space. ResNet (He et al., 2016) uses a simplified two-branch structure that adds the input of a layer to its output through residual connections. The improvements in top-1 accuracy of ImageNet classification using these structures demonstrate the importance of multiple receptive fields (e.g., 1 × 1, 1 × K, K × 1, and K × K convolutions), diverse connections of layers and combination of parallel branches. These performance improvements often come at a computational cost, as complex model topologies are less hardware friendly, and have increased computational requirements. Outside of expertly designed networks, advancements in Neural Architecture Search (NAS) allow for the automation of network design. Several search spaces and discovered high-performing networks (Ying et al., 2019; Dong & Yang, 2020; Ding et al., 2021a) utilize multi-branch structures, which shows their ubiquity in modern convolution architectures. Due to the enormous possibilities of branched model topologies, search is often computationally expensive and requires vast computational resources.

2.2. STRUCTURAL REPARAMETRIZATION

Multi-branch structures enhance the performance of ConvNets. This comes at the cost of higher memory and computational power requirements, which is undesirable for inference-time applica-

availability

//github.com/Ascend-Research/Reparameterization.

