REVISITING DYNAMIC CONVOLUTION VIA MATRIX DECOMPOSITION

Abstract

Recent research in dynamic convolution shows substantial performance boost for efficient CNNs, due to the adaptive aggregation of K static convolution kernels. It has two limitations: (a) it increases the number of convolutional weights by Ktimes, and (b) the joint optimization of dynamic attention and static convolution kernels is challenging. In this paper, we revisit it from a new perspective of matrix decomposition and reveal the key issue is that dynamic convolution applies dynamic attention over channel groups after projecting into a higher dimensional latent space. To address this issue, we propose dynamic channel fusion to replace dynamic attention over channel groups. Dynamic channel fusion not only enables significant dimension reduction of the latent space, but also mitigates the joint optimization difficulty. As a result, our method is easier to train and requires significantly fewer parameters without sacrificing accuracy.

1. INTRODUCTION

Dynamic convolution (Yang et al., 2019; Chen et al., 2020c) has recently become popular for the implementation of light-weight networks (Howard et al., 2017; Zhang et al., 2018b) . Its ability to achieve significant performance gains with negligible computational cost has motivated its adoption for multiple vision tasks (Su et al., 2020; Chen et al., 2020b; Ma et al., 2020; Tian et al., 2020) . The basic idea is to aggregate multiple convolution kernels dynamically, according to an input dependent attention mechanism, into a convolution weight matrix W (x) = K k=1 π k (x)W k s.t. 0 ≤ π k (x) ≤ 1, K k=1 π k (x) = 1, where K convolution kernels {W k } are aggregated linearly with attention scores {π k (x)}. Dynamic convolution has two main limitations: (a) lack of compactness, due to the use of K kernels, and (b) a challenging joint optimization of attention scores {π k (x)} and static kernels {W k }. Yang et al. (2019) proposed the use of a sigmoid layer to generate attention scores {π k (x)}, leading to a significantly large space for the convolution kernel W (x) that makes the learning of attention scores {π k (x)} difficult. Chen et al. (2020c) replaced the sigmoid layer with a softmax function to compress the kernel space. However, small attention scores π k output by the softmax make the corresponding kernels W k difficult to learn, especially in early training epochs, slowing training convergence. To mitigate these limitations, these two methods require additional constraints. For instance, Chen et al. (2020c) uses a large temperature in the softmax function to encourage nearuniform attention. In this work, we revisit the two limitations via matrix decomposition. To expose the limitations, we reformulate dynamic convolution in terms of a set of residuals, re-defining the static kernels as 3). It applies dynamic attention Π(x) over channel groups in a high dimensional space (SV T x ∈ R KC ). Right: proposed dynamic convolution decomposition, which applies dynamic channel fusion Φ(x) in a low dimensional space (Q T x ∈ R L , L C), resulting in a more compact model. Dynamic Convolution Decomposition 𝑼 ! 𝑼 " 𝑼 ∈ ℝ #×"# 𝑺 ∈ ℝ "#×"# 𝑽 ! ∈ ℝ "#×# + … … 𝑽 " % 𝑽 ! % … 𝑺 ! 𝑺 " 𝑸 % 𝚽(𝑥) 𝑷 𝑾 & + 𝑾 & (𝐿 ≪ where W 0 = 1 K K k=1 W k is the average kernel and ∆W k = W k -W 0 a residual weight matrix. Further decomposing the latter with an SVD, ∆W k = U k S k V T k , leads to W (x) = K k=1 π k (x)W 0 + K k=1 π k (x)U k S k V T k = W 0 + U Π(x)SV T , where U = [U 1 , . . . , U K ], S = diag(S 1 , . . . , S K ), V = [V 1 , . . . , V K ], and Π(x) stacks attention scores diagonally as Π(x) = diag(π 1 (x)I, . . . , π K (x)I), where I is an identity matrix. This decomposition, illustrated in Figure 1 , shows that the dynamic behavior of W (x) is implemented by the dynamic residual U Π(x)SV T , which projects the input x to a higher dimensional space SV T x (from C to KC channels), applies dynamic attention Π(x) over channel groups, and reduces the dimension back to C channels, through multiplication by U . This suggests that the limitations of vanilla dynamic convolution are due to the use of attention over channel groups, which induces a high dimensional latent space, leading to small attention values that may suppress the learning of the corresponding channels. To address this issue, we propose a dynamic convolution decomposition (DCD), that replaces dynamic attention over channel groups with dynamic channel fusion. The latter is based on a full dynamic matrix Φ(x), of which each element φ i,j (x) is a function of input x. As shown in Figure 1 -(right), the dynamic residual is implemented as the product P Φ(x)Q T of Φ(x) and two static matrices P , Q, such that Q compresses the input into a low dimensional latent space, Φ(x) dynamically fuses the channels in this space, and P expands the number of channels to the output space. The key innovation is that dynamic channel fusion with Φ(x) enables a significant dimensionality reduction of the latent space (Q T x ∈ R L , L C). Hence the number of parameters in P , Q is significantly reduced, when compared to U , V of Eq. 3, resulting in a more compact model. Dynamic channel fusion also mitigates the joint optimization challenge of vanilla dynamic convolution, as each column of P , Q is associated with multiple dynamic coefficients of Φ(x). Hence, a few dynamic coefficients of small value are not sufficient to suppress the learning of static matrices P , Q. Experimental results show that DCD both significantly reduces the number of parameters and achieves higher accuracy than vanilla dynamic convolution, without requiring the additional constraints of (Yang et al., 2019; Chen et al., 2020c) .

2. RELATED WORK

Efficient CNNs: MobileNet (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019) decomposes k × k convolution into a depthwise and a pointwise convolution. ShuffleNet (Zhang et al., 2018b; Ma et al., 2018) uses group convolution and channel shuffle to further simplify pointwise convolution. Further improvements of these architectures have been investigated recently. Efficient-Net (Tan & Le, 2019a; Tan et al., 2020) finds a proper relationship between input resolution and width/depth of the network. Tan & Le (2019b) mix up multiple kernel sizes in a single convolution. Chen et al. (2020a) trades massive multiplications for much cheaper additions. Han et al. (2020) applies a series of cheap linear transformations to generate ghost feature maps. Zhou et al. (2020) flips the structure of inverted residual blocks to alleviate information loss. Yu et al. (2019) and Cai et al. (2019) train one network that supports multiple sub-networks of different complexities. Matrix Decomposition: Lebedev et al. (2014) and Denton et al. (2014) use Canonical Polyadic decomposition (CPD) of convolution kernels to speed up networks, while Kim et al. (2015) investigates Tucker decompositions for the same purpose. More recently, Kossaifi et al. (2020) combines tensor decompositions with MobileNet to design efficient higher-order networks for video tasks, while Phan et al. (2020) proposes a stable CPD to deal with degeneracies of tensor decompositions during network training. Unlike DCD, which decomposes a convolutional kernel dynamically by adapting the core matrix to the input, these works all rely on static decompositions. Dynamic Neural Networks: Dynamic networks boost representation power by adapting parameters or activation functions to the input. Ha et al. (2017) uses a secondary network to generate parameters for the main network. Hu et al. (2018) reweights channels by squeezing global context. Li et al. (2019) adapts attention over kernels of different sizes. Dynamic convolution (Yang et al., 2019; Chen et al., 2020c) aggregates multiple convolution kernels based on attention. Ma et al. (2020) uses grouped fully connected layer to generate convolutional weights directly. Chen et al. (2020b) extends dynamic convolution from spatial agnostic to spatial specific. Su et al. (2020) proposes dynamic group convolution that adaptively selects input channels to form groups. Tian et al. (2020) applies dynamic convolution to instance segmentation. Chen et al. (2020d) adapts slopes and intercepts of two linear functions in ReLU (Nair & Hinton, 2010; Jarrett et al., 2009) .

3. DYNAMIC CONVOLUTION DECOMPOSITION

In this section, we introduce the dynamic convolution decomposition proposed to address the limitations of vanilla dynamic convolution. For conciseness, we assume a kernel W with the same number of input and output channels (C in = C out = C) and ignore bias terms. We focus on 1 × 1 convolution in this section and generalize the procedure to k×k convolution in the following section.

3.1. REVISITING VANILLA DYNAMIC CONVOLUTION

Vanilla dynamic convolution aggregates K convolution kennels {W k } with attention scores {π k (x)} (see Eq. 1). It can be reformulated as adding a dynamic residual to a static kernel, and the dynamic residual can be further decomposed by SVD (see Eq. 3), as shown in Figure 1 . This has two limitations. First, the model is not compact. Essentially, it expands the number of channels by a factor of K and applies dynamic attention over K channel groups. The dynamic residual U Π(x)SV T is a C × C matrix, of maximum rank C, but sums KC rank-1 matrices, since W (x) = W 0 + U Π(x)SV T = W 0 + KC i=1 π i/C (x)u i s i,i v T i , where u i is the i th column vector of matrix U , v i is the i th column vector of matrix V , s i,i is the i th diagonal entry of matrix S and • is ceiling operator. The static basis vectors u i and v i are not shared across different rank-1 matrices (π i/C (x)u i s i,i v T i ) . This results in model redundancy. Second, it is difficult to jointly optimize static matrices U , V and dynamic attention Π(x). This is because a small attention score π i/C may suppress the learning of corresponding columns u i , v i in U and V , especially in early training epochs (as shown in Chen et al. (2020c) ).

3.2. DYNAMIC CHANNEL FUSION

We propose to address the limitations of the vanilla dynamic convolution with a dynamic channel fusion mechanism, implemented with a full matrix Φ(x), where each element φ i,j (x) is a function of input x. Φ(x) is a L × L matrix, dynamically fusing channels in the latent space R L . The key idea is to significantly reduce dimensionality in the latent space, L C, to enable a more compact model. Dynamic convolution is implemented with dynamic channel fusion using W (x) = W 0 + P Φ(x)Q T = W 0 + L i=1 L j=1 p i φ i,j (x)q T j , where Q ∈ R C×L compresses the input into a low dimensional space (Q T x ∈ R L ), the resulting L channels are fused dynamically by Φ(x) ∈ R L×L and expanded to the number of output channels by P ∈ R C×L . This is denoted as dynamic convolution decomposition (DCD). The dimension L of the latent space is constrained by L 2 < C. The default value of L in this paper is empirically set to C 2 log 2 √ C , which means dividing C by 2 repeatedly until it is less than √ C. With this new design, the number of static parameters is significantly reduced (i.e. LC parameters in P or Q v.s. KC 2 parameters in U or V , L < √ C), resulting in a more compact model. Mathematically, the dynamic residual P Φ(x)Q T sums L 2 rank-1 matrices p i φ i,j (x)q T j , where p i is the i th column vector of P , and q j is the j th column vector of Q. The constraint L 2 < C, guarantees that this number (L 2 ) is much smaller than the counterpart (KC) of vanilla dynamic convolution (see Eq. 4). Nevertheless, due to the use of a full matrix, dynamic channel fusion Φ(x) retains the representation power needed to achieve good classification performance. DCD also mitigates the joint optimization difficulty. Since each column of P (or Q) is associated with multiple dynamic coefficients (e.g. p i is related to φ i,1 , . . . , φ i,L ), it is unlikely that the learning of p i is suppressed by a few dynamic coefficients of small value. In summary, DCD performs dynamic aggregation differently from vanilla dynamic convolution. Vanilla dynamic convolution uses a shared dynamic attention mechanism to aggregate unshared static basis vectors in a high dimensional latent space. In contrast, DCD uses an unshared dynamic channel fusion mechanism to aggregate shared static basis vectors in a low dimensional latent space.

3.3. MORE GENERAL FORMULATION

So far, we have focused on the dynamic residual and shown that dynamic channel fusion enables a compact implementation of dynamic convolution. We next discuss the static kernel W 0 . Originally, it is multiplied by a dynamic scalar k π k (x), which is canceled in Eq. 3 as attention scores sum to one. Relaxing the constraint k π k (x) = 1 results in the more general form W (x) = Λ(x)W 0 + P Φ(x)Q T , where Λ(x) is a C × C diagonal matrix and λ i,i (x) a function of x. In this way, Λ(x) implements channel-wise attention after the static kernel W 0 , generalizing Eq. 5 where Λ(x) is an identity matrix. Later, we will see that this generalization enables additional performance gains. Relation to Squeeze-and-Excitation (SE) (Hu et al., 2018) : The dynamic channel-wise attention mechanism implemented by Λ(x) is related to but different from SE. It is parallel to a convolution and shares the input with the convolution. It can be thought of as either a dynamic convolution kernel y = (Λ(x)W 0 )x or an input-dependent attention mechanism applied to the output feature map of the convolution y = Λ(x)(W 0 x). Thus, its computational complexity is min(O(C 2 ), O(HW C)), where H and W are height and width of the feature map. In contrast, SE is placed after a convolution and uses the output of the convolution as input. It can only apply channel attention on the output feature map of the convolution as y = Λ(z)z, where z = W 0 x. Its computational complexity is O(HW C). Clearly, SE requires more computation than dynamic channel-wise attention Λ(x) when the resolution of the feature map (H × W ) is high.

3.4. DYNAMIC CONVOLUTION DECOMPOSITION LAYER

Implementation: Figure 2 shows the diagram of a dynamic convolution decomposition (DCD) layer. It uses a light-weight dynamic branch to generate coefficients for both dynamic channel-wise attention Λ(x) and dynamic channel fusion Φ(x). Similar to Squeeze-and-Excitation (Hu et al., 2018) , the dynamic branch first applies average pooling to the input x. This is followed by two fully connected (FC) layers with an activation layer between them. The first FC layer reduces the number of channels by r and the second expands them into C + L 2 outputs (C for Λ and L 2 for Φ). Eq. 6 is finally used to generate convolutional weights W (x). Similarly to a static convolution, a DCD layer also includes a batch normalization and an activation (e.g. ReLU) layer. Parameter Complexity: DCD has similar FLOPs to the vanilla dynamic convolution. Here, we focus on parameter complexity. Static convolution and vanilla dynamic convolution require C 2 and KC 2 parameters, respectively. DCD requires C 2 , CL, and CL parameters for static matrices W 0 , P and Q, respectively. An additional (2C + L 2 ) C r parameters are required by the dynamic branch The input x first goes through a dynamic branch to generate Λ(x) and Φ(x), and then to generate the convolution matrix W (x) using Eq. 6. to generate Λ(x) and Φ(x), where r is the reduction rate of the first FC layer. The total complexity is  𝑾 ! 𝚲(𝒙) 𝑷 " 𝑸 ! " 𝚽 " (𝒙) 𝑷 # 𝑸 # " 𝚽 # (𝒙) 𝑷 $ 𝑸 $ " 𝚽 $ (𝒙) + Sparse Dynamic Residual C 2 + 2CL + (2C + L 2 ) C r . Since L is constrained as L 2 < C, the complexity upper bound is (1 + 3 r )C 2 + 2C √ C. When choosing r =

4. EXTENSIONS OF DYNAMIC CONVOLUTION DECOMPOSITION

In this section, we extend the dynamic decomposition of 1 × 1 convolution (Eq. 6) in three ways: (a) sparse dynamic residual where P Φ(x)Q T is a diagonal block matrix, (b) k × k depthwise convolution, and (c) k × k convolution. Here, k refers to the kernel size.

4.1. DCD WITH SPARSE DYNAMIC RESIDUAL

The dynamic residual P Φ(x)Q T can be further simplified into a block-diagonal matrix of blocks P b Φ b (x)Q T b , b ∈ {1, . . . , B}, leading to W (x) = Λ(x)W 0 + B b=1 P b Φ b (x)Q T b , where n i=1 A i = diag(A 1 , . . . , A n ). This form has Eq. 6 as a special case, where B = 1. Note that the static kernel W 0 is still a full matrix and only the dynamic residual is sparse (see Figure 3 ). We will show later that keeping as few as 1 8 of the entries of the dynamic residual non-zero (B = 8) has a minimal performance degradation, still significantly outperforming a static kernel.

4.2. DCD OF k × k DEPTHWISE CONVOLUTION

The weights of a k × k depthwise convolution kernel form a C × k 2 matrix. DCD can be generalized to such matrices by replacing in Eq. 6 the matrix Q (which squeezes the number of channels) with a matrix R (which squeezes the number of kernel elements) W (x) = Λ(x)W 0 + P Φ(x)R T , where W (x) and W 0 are C × k 2 matrices, Λ(x) is a diagonal C × C matrix that implements channel-wise attention, R is a k 2 × L k matrix that reduces the number of kernel elements from k 2 to L k , Φ(x) is a L k × L k matrix that performs dynamic fusion along L k latent kernel elements and P is a C × L k weight matrix for depthwise convolution over L k kernel elements. The default value of L k is k 2 /2 . Since depthwise convolution is channel separable, Φ(x) does not fuse channels, fusing instead L k latent kernel elements. 

Static Kernel Dynamic Residual Static Kernel Dynamic Residual

(a) 𝚽(𝒙) fuses both channels and kernel elements. tensor. DCD can be generalized to such tensors by extending Eq. 6 into a tensor form (see Figure 4 ) 𝐶×𝐶×𝑘 ! 𝐶×𝐶×𝑘 ! 𝐶×𝐶 𝐶×𝐿 𝐿×𝐿×𝐿 " 𝑘 ! ×𝐿 " 𝐶×𝐿 𝐶×𝐶×𝑘 ! 𝐶×𝐶 𝐶×𝐿 𝑘 ! ×1 𝐶×𝐿 𝐿×𝐿×1 𝑘×𝑘 conv 𝐶×𝐶×𝑘 ! 𝑘×𝑘 conv 𝐶×𝐶×𝑘 ! 𝑘×𝑘 conv 𝐶×𝐶×𝑘 ! 1×1 conv 𝐶×𝐶×𝑘 ! W (x) = W 0 × 2 Λ(x) + Φ(x) × 1 Q × 2 P × 3 R, where × n refers to n-mode multiplication (Lathauwer et al., 2000) ,  W 0 is a C × C × k 2 tensor, Λ(x) is a diagonal C × C matrix that implements channel-wise attention, Q is a C × L matrix that reduces the number of input channels from C to L, R is a k 2 × L k matrix that reduces the number of kernel elements from k 2 to L k , Φ(x) is a L×L×L k tensor that L k = k 2 /2 , L = C/L k 2 log 2 √ C/L k . Channel fusion alone: We found that the fusion of channels Φ(x) × 1 Q is more important than the fusion of kernel elements Φ(x) × 3 R. Therefore, we reduce L k to 1 and increase L accordingly. R is simplified into a one-hot vector [0, . . . , 0, 1, 0, . . . , 0] T , where the '1' is located at the center (assuming that k is an odd number). As illustrated in Figure 4 -(b), the tensor of dynamic residual Φ(x) × 1 Q × 2 P × 3 R only has one non-zero slice, which is equivalent to a 1 × 1 convolution. Therefore, the DCD of a k × k convolution is essentially adding a 1 × 1 dynamic residual to a static k × k kernel.

5. EXPERIMENTS

In this section, we present the results of DCD on ImageNet classification (Deng et al., 2009) . Im-ageNet has 1,000 classes with 1,281,167 training and 50, 000 validation images. We also report ablation studies on different components of the approach. All experiments are based on two network architectures: ResNet (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) . DCD is implemented on all convolutional layers of ResNet and all 1 × 1 convolutional layers of MobileNetV2. The reduction ratio r is set to 16 for ResNet and MobileNetV2 ×1.0, and to 8 for smaller models (MobileNetV2 ×0.5 and ×0.35). All models are trained by SGD with momentum 0.9. The batch size is 256 and remaining training parameters are as follows. ResNet: The learning rate starts at 0.1 and is divided by 10 every 30 epochs. The model is trained with 100 epochs. Dropout (Srivastava et al., 2014) 0.1 is used only for ResNet-50.

MobileNetV2:

The initial learning rate is 0.05 and decays to 0 in 300 epochs, according to a cosine function. Weight decay of 2e-5 and a dropout rate of 0.1 are also used. For MobileNetV2 ×1.0, Mixup (Zhang et al., 2018a) and label smoothing are further added to avoid overfitting. 

5.1. INSPECTING DIFFERENT DCD FORMULATIONS

Table 1 summarizes the influence of different components (e.g. dynamic channel fusion Φ(x), dynamic channel-wise attention Λ(x)) of DCD on MobileNet V2 ×0.5 and ResNet-18 performance. The table shows that both dynamic components, Λ(x) and Φ(x) of Eq. 6. enhance accuracy substantially (+2.8% and +3.8% for MobileNetV2 ×0.5, +1.1% and +2.4% for ResNet-18), when compared to the static baseline. Using dynamic channel fusion only (W 0 + P ΦQ T ) has slightly more parameters, FLOPs, and accuracy than using dynamic channel-wise attention only (ΛW 0 ). The combination of the two mechanisms provides additional improvement.

5.2. ABLATIONS

A number of ablations were performed on MobileNet V2 ×0.5 to analyze DCD performance in terms of two questions. 1. How does the dimension (L) of the latent space affect performance? 2. How do three DCD variants perform? The default configuration is the general form of DCD (Eq. 6) with a full size dynamic residual (B = 1) for all pointwise convolution layers. The default latent space dimension is L = C 2 log 2 √ C . Latent Space Dimension L: The dynamic channel fusion matrix Φ(x) has size L × L. Thus, L controls both the representation and the parameter complexity of DCD. We adjust it by applying different multipliers to the default value of L. Table 2 shows the results of MobileNetV2 ×0.5 for four multiplier values ranging from ×1.0 to ×0.25. As L decreases, fewer parameters are required and the performance degrades slowly. Even with a very low dimensional latent space (L × 0.25), DCD still outperforms the static baseline by 3.3% top-1 accuracy.

Number of Diagonal Blocks B in the Dynamic

Residual: Table 3 -(a) shows classification results for four values of B. The dynamic residual is a full matrix when B = 1, while only 1 8 of its entries are non-zero for B = 8. Accuracy degrades slowly as the dynamic residual becomes sparser (increasing B). The largest performance drop happens when B is changed from 1 to 2, as half of the weight matrix W (x) becomes static. However, performance is still significantly better than that of the static baseline. The fact that even the sparsest B = 8 outperforms the static baseline by 2.9% (from 65.4% to 68.3%) demonstrates the representation power of the dynamic residual. In all cases, dynamic channel-wise attention Λ(x) enables additional performance gains. DCD at Different Layers: Table 3 -(b) shows the results of implementing DCD for three different types of layers (a) DW: depthwise convolution (Eq. 8), (b) PW: pointwise convolution (Eq. 6), and (c) CLS: fully connected classifier, which is a special case of pointwise convolution (the input resolution is 1 × 1). Using DCD in any type of layer improves on the performance of the static baseline (+2.9% for depthwise convolution, +4.4% for pointwise convolution, and +1.2% for classifier). Combining DCD for both pointwise convolution and classifier achieves the best performance (+4.8%). We notice a performance drop (from 70.2% to 70.0%) when using DCD in all three types of layers. We believe this is due to overfitting, as it has higher training accuracy. Extension to 3 × 3 Convolution: We use ResNet-18, which stacks 16 layers of 3 × 3 convolution, to study the 3 × 3 extension of DCD (see Section 4.3). Compared to the static baseline (70.4% top-1 accuracy), DCD with joint fusion of channels and kernel elements (Eq. 9) improves top-1 accuracy (71.3%) by 0.9%. The top-1 accuracy is further improved by 1.8% (73.1%), when using DCD with channel fusion alone, which transforms the dynamic residual as a 1 × 1 convolution matrix (see Figure 4-(b) ). This demonstrates that dynamic fusion is more effective across channels than across kernel elements. Summary: Based on the ablations above, DCD should be implemented with both dynamic channel fusion Φ and dynamic channel-wise attention Λ, the default latent space dimension L, and a full size residual B = 1. DCD is recommended for pointwise convolution and classifier layers in MobileNetV2. For 3 × 3 convolutions in ResNet, DCD should be implemented with channel fusion alone. The model can be made more compact, for a slight performance drop, by (a) removing dynamic channel-wise attention Λ, (b) reducing the latent space dimension L, (c) using a sparser dynamic residual (increasing B), and (d) implementing DCD in depthwise convolution alone.

5.3. MAIN RESULTS

DCD was compared to the vanilla dynamic convolution (Yang et al., 2019; Chen et al., 2020c) for MobileNetV2 and ResNet, using the settings recommended above, with the results of Table 4 foot_0 . DCD significantly reduces the number of parameters while improving the performance of both network architectures. For MobileNetV2 ×1.0, DCD only requires 50% of the parameters of (Chen et al., 2020c) and 25% of the parameters of (Yang et al., 2019) . For ResNet-18, it only requires 33% of the parameters of (Chen et al., 2020c) , while achieving a 0.4% gain in top-1 accuracy. Although DCD requires slightly more MAdds than (Chen et al., 2020c) , the increment is negligible. These results demonstate that DCD is more compact and effective. To validate the dynamic property, Φ(x) should have different values over different images. We measure this by averaging the variance of each entry σ Φ = i,j σ i,j /L 2 , where σ i,j is the variance of φ i,j (x), over all validation images. To compare σ Φ across layers, we normalize it by the variance of the corresponding input feature map. Figure 6 shows the normalized variance σ Φ across layers in Mo-bileNetV2. Clearly, the dynamic coefficients vary more in the higher layers. We believe this is because the higher layers encode more context information, providing more clues to adapt convolution weights.

5.5. INFERENCE TIME

We use a single-threaded core AMD EPYC CPU 7551P (2.0 GHz) to measure running time (in milliseconds) on MobileNetV2 ×0.5 and ×1.0. Running time is calculated by averaging the inference time of 5,000 images with batch size 1. Both static baseline and DCD are implemented in PyTorch. Compared with the static baseline, DCD consumes about 8% more MAdds (97.0M vs 104.8M) and 14% more running time (91ms vs 104ms) for MobileNetV2 ×0.5. For MobileNetV2 ×1.0, DCD consumes 9% more MAdds (300.0M vs 326.0M) and 12% more running time (146ms vs 163ms). The overhead is higher in running time than MAdds. We believe this is because the optimizations of global average pooling and fully connected layers are not as efficient as convolution. This small penalty in inference time is justified by the DCD gains of 4.8% and 3.2% top-1 accuracy over Mo-bileNetV2 ×0.5 and ×1.0 respectively.

6. CONCLUSION

In this paper, we have revisited dynamic convolution via matrix decomposition and demonstrated the limitations of dynamic attention over channel groups: it multiplies the number of parameters by K and increases the difficulty of joint optimization. We proposed a dynamic convolution decomposition to address these issues. This applies dynamic channel fusion to significantly reduce the dimensionality of the latent space, resulting in a more compact model that is easier to learn with often improved accuracy. We hope that our work provides a deeper understanding of the gains recently observed for dynamic convolution.



The baseline results are from the original papers. Our implementation, under the setup used for DCD, has either similar or slightly lower results, e.g. for MobileNetV2×1.0 the original paper reports 72.0%, while our implementation achieves 71.8%.



Figure1: Dynamic convolution via matrix decomposition. Left: Reformulating the vanilla dynamic convolution by matrix decomposition (see Eq. 3). It applies dynamic attention Π(x) over channel groups in a high dimensional space (SV T x ∈ R KC ). Right: proposed dynamic convolution decomposition, which applies dynamic channel fusion Φ(x) in a low dimensional space (Q T x ∈ R L , L C), resulting in a more compact model.

Figure 2: Dynamic convolution decomposition layer.The input x first goes through a dynamic branch to generate Λ(x) and Φ(x), and then to generate the convolution matrix W (x) using Eq. 6.

Figure 3: Sparse dynamic residual, which is represented as a diagonal block matrix. Each diagonal block is decomposed separately as P b Φ b Q T b . Note that the static kernel W 0 is still a full size matrix.

𝚽(𝒙) fuses channels alone.

Figure 4: The dynamic convolution decomposition for k × k convolution.

performs joint fusion of L channels over L k latent kernel elements, and P is a C × L matrix that expands the number of channels from L to C. The numbers of latent channels L and latent kernel elements L k are constrained by L k < k 2 and L 2 L k ≤ C. Their default values are set empirically to

Figure 5: The comparison of training and validation error between DCD and DY-Conv on MobileNetV2 ×0.5. τ is the temperature in softmax. Best viewed in color.

Figure5compares DCD to DY-Conv(Chen et al., 2020c)  in terms of training convergence. DY-Conv uses a large temperature in its softmax to alleviate the joint optimization difficulty and make training more efficient. Without any additional parameter tuning, DCD converges even faster than DY-Conv with a large temperature and achieves higher accuracy.

16, the complexity is about 1 3 16 C 2 . This is much less than what is typical for vanilla dynamic convolution (4C 2 in Chen et al. (2020c) and 8C 2 in Yang et al. (2019)).

Different formulations of dynamic convolution decomposition on ImageNet classification.

Dimension of the latent space L evaluated on ImageNet classification (MobileNetV2 ×0.5 is used).

Extensions of dynamic convolution decompostion (DCD) evaluated on ImageNet classification (MobileNetV2 ×0.5 is used).

Comparing DCD with the vanilla dynamic convolution CondConv(Yang et al., 2019) and DY-Conv(Chen et al., 2020c). 1indicates the dynamic model with the fewest parameters (static model is not included). CondConv contains K = 8 kernels and DY-Conv contains K = 4 kernels.

