LEARNING DEEPLY SHARED FILTER BASES FOR EFFICIENT CONVNETS

Abstract

Recently, inspired by repetitive block structure of modern ConvNets, such as ResNets, parameter-sharing among repetitive convolution layers has been proposed to reduce the size of parameters. However, naive sharing of convolution filters poses many challenges such as overfitting and vanishing/exploding gradients, resulting in worse performance than non-shared counterpart models. Furthermore, sharing parameters often increases computational complexity due to additional operations for re-parameterization. In this work, we propose an efficient parametersharing structure and an effective training mechanism for recursive ConvNets. In the proposed ConvNet architecture, convolution layers are decomposed into a filter basis, that can be shared recursively, and non-shared layer-specific parts. We conjecture that a shared filter basis combined with a small amount of layerspecific parameters can retain, or further enhance, the representation power of individual layers, if a proper training method is applied. We show both theoretically and empirically that potential vanishing/exploding gradients problems can be mitigated by enforcing orthogonality to the shared filter bases. Experimental results demonstrate that our scheme effectively reduces redundancy by saving up to 63.8% of parameters while consistently outperforming non-shared counterpart networks even when a filter basis is shared by up to 10 repetitive convolution layers.

1. INTRODUCTION

Modern networks such as ResNets usually have massive identical convolution blocks and recent analytic studies (Jastrzebski et al., 2018) show that these blocks perform similar iterative refinement rather than learning new features. Inspired by these massive identical block structure of modern networks, recursive ConvNets sharing weights across iterative blocks have been studied as a promising direction to parameter-efficient ConvNets (Jastrzebski et al., 2018; Guo et al., 2019; Savarese & Maire, 2019) . However, repetitive use of parameters across many convolution layers incurs several challenges that limit the performance of such recursive networks. First of all, deep sharing of parameters might result in vanishing gradients and exploding gradients problems, which are often found in recurrent neural networks (RNNs) (Pascanu et al., 2013; Jastrzebski et al., 2018) . Another challenge is that overall representation power of the networks might be limited by using same filters repeatedly for many convolution layers. To address aforementioned challenges, in this paper, we propose an effective and efficient parametersharing mechanism for modern ConvNets having many repetitive convolution blocks. In our work, convolution filters are decomposed into a fundamental and reusable unit, which is called a filter basis, and a layer-specific part, which is called coefficients. By sharing a filter basis, not whole convolution filters or a layer, we can impose two desirable properties on the shared parameters: (1) resilience against vanishing/exploding gradients, and (2) representational expressiveness of individual layers sharing parameters. We first show theoretically that a shared filter basis can cause vanishing gradients and exploding gradients problems, and this problem can be controlled to a large extent by making filter bases orthogonal. To enforce the orthogonality of filter bases, we propose an orthogonality regularization to train ConvNets having deeply shared filter bases. Our experimental results show that the proposed orthogonality regularization reduces the redundancy not just in deeply shared filter bases, but also in none-shared parameters, resulting in better performance than over-parameterized counterpart networks. Next, we make convolution layers with shared parameters more expressive using a hybrid approach to sharing filter bases, in which a small number of layer-specific non-shared filter basis components are combined with shared filter basis components. With this hybrid scheme, the constructed filters can be positioned in different vector subspaces that reflect the peculiarity of individual convolution layers. We argue that these layer-specific variations contribute to increasing the representation power of the networks when a large portion of parameters is shared. Since our focus is not on pushing the state-of-the-art performance, we show the validity of our work using widely-used ResNets as a base model on image classification tasks with CIFAR and ImageNet datasets. Our experimental results demonstrate that when each filter basis is shared by up to 10 convolution layers, our method consistently outperforms counterpart ConvNet models while reducing a significant amount of parameters and computational costs. For example, our method can save up to 63.8% of parameters and 33.4% of FLOPs, respectively, while achieving lower test errors than much deeper counterpart models. Our parameter sharing structure and training mechanism can be applied to modern compact networks, such as MobileNets (Howard et al., 2017) and ShuffleNets (Zhang et al., 2018) with minor adaptations. Since these compact models already have decomposed convolution blocks, some parts of each block can be identified as a shareable filter basis and the rest of the layer-specific parts. In Experiments, we demonstrate that compact MobileNetV2 can achieve further 8-21% parameter savings with our scheme while retaining, or improving, the performance of the original models.

2. RELATED WORK

Recursive networks and parameter sharing: Recurrent neural networks (RNNs) (Graves et al., 2013) have been well-studied for temporal and sequential data. As a generalization of RNNs, recursive variants of ConvNets are used extensively for visual tasks (Socher et al., 2011; Liang & Hu, 2015; Xingjian et al., 2015; Kim et al., 2016; Zamir et al., 2017) . For instance, Eigen et al. ( 2014) explore recursive convolutional architectures that share filters across multiple convolution layers. They show that recurrence with deeper layers tends to increase performance. However, their recursive architecture shows worse performance than independent convolution layers due to overfitting. In most previous works, filters themselves are shared across layers. In contrast, we propose to share filter bases that are more fundamental and reusable building blocks to construct layer-specific filters. More recently, Jastrzebski et al. (2018) show that iterative refinement of features in ResNets suggests that deep networks can potentially leverage intensive parameter sharing. Guo et al. ( 2019) introduce a gate unit to determine whether to jump out of the recursive loop of convolution blocks to save computational resources. These works show that training recursive networks with naively shared blocks leads to bad performance due to the problem of gradient explosion and vanish like RNN (Pascanu et al., 2013; Vorontsov et al., 2017) . In order to mitigate the problem of gradient explosion and vanish, they suggest unshared batch normalization strategy. In our work, we propose an orthogonality regularization of shared filter bases to further address this problem. Savarese & Maire (2019)'s work is also relevant to our work. In their work, the parameters of recurrent layers of ConvNets are generated by a linear combination of 1-2 parameter tensors from a global bank of templates. Though similar to our work, our work suggests more fine-grained filter bases as more desirable building blocks for effective parameter sharing since filter bases can be easily combined with layer-specific non-shared components for better representation power. Our result shows that these layer-specific non-shared components are critical to achieve high performance. Although they achieve about 60% parameter savings, their approach does not outperform counterpart models and incurs slight increases in computational costs due to the overheads in reparameterizing tensors from the templates. Model compression and efficient convolution block design: Reducing storage and inference time of ConvNets has been an important research topic for both resource constrained mobile/embedded systems and energy-hungry data centers. A number of research techniques have been developed such as filter pruning (LeCun et al., 1990; Polyak & Wolf, 2015; Li et al., 2017; He et al., 2017) , low-rank factorization (Denton et al., 2014; Jaderberg et al., 2014 ), quantization (Han et al., 2016) , and knowledge distillation (Hinton et al., 2015; Chen et al., 2017) , to name a few. These compression techniques have been suggested as post-processing steps that are applied after initial training. Unfortunately, their accuracy is usually bounded by the approximated original models. By contrast, our models are trained from scratch as in Ioannou et al. ( 2017)'s work and our result shows that

