REVISITING DYNAMIC CONVOLUTION VIA MATRIX DECOMPOSITION

Abstract

Recent research in dynamic convolution shows substantial performance boost for efficient CNNs, due to the adaptive aggregation of K static convolution kernels. It has two limitations: (a) it increases the number of convolutional weights by Ktimes, and (b) the joint optimization of dynamic attention and static convolution kernels is challenging. In this paper, we revisit it from a new perspective of matrix decomposition and reveal the key issue is that dynamic convolution applies dynamic attention over channel groups after projecting into a higher dimensional latent space. To address this issue, we propose dynamic channel fusion to replace dynamic attention over channel groups. Dynamic channel fusion not only enables significant dimension reduction of the latent space, but also mitigates the joint optimization difficulty. As a result, our method is easier to train and requires significantly fewer parameters without sacrificing accuracy.

1. INTRODUCTION

Dynamic convolution (Yang et al., 2019; Chen et al., 2020c) has recently become popular for the implementation of light-weight networks (Howard et al., 2017; Zhang et al., 2018b) . Its ability to achieve significant performance gains with negligible computational cost has motivated its adoption for multiple vision tasks (Su et al., 2020; Chen et al., 2020b; Ma et al., 2020; Tian et al., 2020) . The basic idea is to aggregate multiple convolution kernels dynamically, according to an input dependent attention mechanism, into a convolution weight matrix W (x) = K k=1 π k (x)W k s.t. 0 ≤ π k (x) ≤ 1, K k=1 π k (x) = 1, where K convolution kernels {W k } are aggregated linearly with attention scores {π k (x)}. In this work, we revisit the two limitations via matrix decomposition. To expose the limitations, we reformulate dynamic convolution in terms of a set of residuals, re-defining the static kernels as

Dynamic convolution has

W k = W 0 + ∆W k , k ∈ {1, . . . , K} Dynamic Convolution Decomposition 𝑼 ! 𝑼 " 𝑼 ∈ ℝ #×"# 𝑺 ∈ ℝ "#×"# 𝑽 ! ∈ ℝ "#×# + … … 𝑽 " % 𝑽 ! % … 𝑺 ! 𝑺 " 𝑸 % 𝚽(𝑥) 𝑷 𝑾 & + 𝑾 & (𝐿 ≪ 𝐶) Vanilla Dynamic Convolution 𝐶×𝐶 𝜋 " 𝑥 𝑰 … 𝚷 𝑥 ∈ ℝ "#×"# 𝐶×𝐶 𝐶×𝐶 𝐶×𝐶 𝐶×𝐶 𝐶×𝐶 𝐶×𝐿 𝐿×𝐿 𝐿×𝐶 𝜋 ! 𝑥 𝑰 𝐶×𝐶 𝐶×𝐶 𝐶×𝐶 𝐶×𝐶 Figure 1 : Dynamic convolution via matrix decomposition. Left: Reformulating the vanilla dynamic convolution by matrix decomposition (see Eq. 3). It applies dynamic attention Π(x) over channel groups in a high dimensional space (SV T x ∈ R KC ). Right: proposed dynamic convolution decomposition, which applies dynamic channel fusion Φ(x) in a low dimensional space (Q T x ∈ R L , L C) , resulting in a more compact model. where W 0 = 1 K K k=1 W k is the average kernel and ∆W k = W k -W 0 a residual weight matrix. Further decomposing the latter with an SVD, ∆W k = U k S k V T k , leads to W (x) = K k=1 π k (x)W 0 + K k=1 π k (x)U k S k V T k = W 0 + U Π(x)SV T , where U = [U 1 , . . . , U K ], S = diag(S 1 , . . . , S K ), V = [V 1 , . . . , V K ], and Π(x) stacks attention scores diagonally as Π(x) = diag(π 1 (x)I, . . . , π K (x)I), where I is an identity matrix. This decomposition, illustrated in Figure 1 , shows that the dynamic behavior of W (x) is implemented by the dynamic residual U Π(x)SV T , which projects the input x to a higher dimensional space SV T x (from C to KC channels), applies dynamic attention Π(x) over channel groups, and reduces the dimension back to C channels, through multiplication by U . This suggests that the limitations of vanilla dynamic convolution are due to the use of attention over channel groups, which induces a high dimensional latent space, leading to small attention values that may suppress the learning of the corresponding channels. To address this issue, we propose a dynamic convolution decomposition (DCD), that replaces dynamic attention over channel groups with dynamic channel fusion. The latter is based on a full dynamic matrix Φ(x), of which each element φ i,j (x) is a function of input x. As shown in Figure 1 -(right), the dynamic residual is implemented as the product P Φ(x)Q T of Φ(x) and two static matrices P , Q, such that Q compresses the input into a low dimensional latent space, Φ(x) dynamically fuses the channels in this space, and P expands the number of channels to the output space. The key innovation is that dynamic channel fusion with Φ(x) enables a significant dimensionality reduction of the latent space (Q T x ∈ R L , L C). Hence the number of parameters in P , Q is significantly reduced, when compared to U , V of Eq. 3, resulting in a more compact model. Dynamic channel fusion also mitigates the joint optimization challenge of vanilla dynamic convolution, as each column of P , Q is associated with multiple dynamic coefficients of Φ(x). Hence, a few dynamic coefficients of small value are not sufficient to suppress the learning of static matrices P , Q. Experimental results show that DCD both significantly reduces the number of parameters and achieves higher accuracy than vanilla dynamic convolution, without requiring the additional constraints of (Yang et al., 2019; Chen et al., 2020c) .

2. RELATED WORK

Efficient CNNs: MobileNet (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019) 



two main limitations: (a) lack of compactness, due to the use of K kernels, and (b) a challenging joint optimization of attention scores {π k (x)} and static kernels {W k }.Yang  et al. (2019)  proposed the use of a sigmoid layer to generate attention scores {π k (x)}, leading to a significantly large space for the convolution kernel W (x) that makes the learning of attention scores {π k (x)} difficult. Chen et al. (2020c) replaced the sigmoid layer with a softmax function to compress the kernel space. However, small attention scores π k output by the softmax make the corresponding kernels W k difficult to learn, especially in early training epochs, slowing training convergence. To mitigate these limitations, these two methods require additional constraints. For instance, Chen et al. (2020c) uses a large temperature in the softmax function to encourage nearuniform attention.

decomposes k × k convolution into a depthwise and a pointwise convolution.ShuffleNet (Zhang et al.,  2018b; Ma et al., 2018)  uses group convolution and channel shuffle to further simplify pointwise convolution. Further improvements of these architectures have been investigated recently. Efficient-Net (Tan & Le, 2019a; Tan et al., 2020) finds a proper relationship between input resolution and width/depth of the network. Tan & Le (2019b) mix up multiple kernel sizes in a single convolution. Chen et al. (2020a) trades massive multiplications for much cheaper additions. Han et al. (2020) applies a series of cheap linear transformations to generate ghost feature maps. Zhou et al. (2020) flips the structure of inverted residual blocks to alleviate information loss. Yu et al. (2019) and Cai et al. (2019) train one network that supports multiple sub-networks of different complexities.

