FILTRA: RETHINKING STEERABLE CNN BY FILTER TRANSFORM

Abstract

Steerable CNN imposes the prior knowledge of transformation invariance or equivariance in the network architecture to enhance the the network robustness on geometry transformation of data and reduce overfitting. Filter transform has been an intuitive and widely used technique to construct steerable CNN in the past decades. Recently, group representation theory is used to analyze steerable CNN and reveals the function space structure of a steerable kernel function. However, it is not yet clear on how this theory is related to the filter transform technique. In this paper, we show that kernel constructed by filter transform can also be interpreted in the group representation theory. Meanwhile, we show that filter transformed kernels can be used to convolve input/output features in different group representation. This interpretation help complete the puzzle of steerable CNN theory and provides a novel and simple approach to implement steerable convolution operators. Experiments are executed on multiple datasets to verify the feasibilty of the proposed approach.

1. INTRODUCTION

Beyond the well-known property of equivariance under translation, there has been substantial recent interest in CNN architectures that are equivariant with respect to other transformation groups, e.g. reflection and rotation. Applications of such architectures range over scenarios where object orientation might variate, including OCR, aerial image processing, 3D point cloud processing, medical image processing, texture analysis and etc. Previous works on constructing equivariant CNN can be coarsely categorized as two aspects. The first aspect designs special steerable filters so that the convolutional output is hard-baked to transform accordingly when the input reflects or rotates. A plenty of works develop this idea by filter rotation, including hand-crafted filters (Oyallon & Mallat, 2015) and learned filters (Laptev et al., 2016; Zhou et al., 2017; Cheng et al., 2018; Marcos et al., 2017) . TI-Pooling (Laptev et al., 2016) produce invariant output as input rotates. ORN (Zhou et al., 2017) and RotDCF (Cheng et al., 2018) produces output which circularly shifted as input rotates. Since each dimension of such permutable output corresponds to a uniformly discrete rotation angle, RotEqNet (Marcos et al., 2017) propose to extract rotation angle from the permutable features. Another approach to construct steerable filters is to linearly combine a set of steerable bases. These basis can be solved in discrete function space (Cohen & Welling, 2014; 2016) or continuous function space (Worrall et al., 2017; Weiler & Cesa, 2019) . Weiler & Cesa (2019) comprehensively summarize works on steerable bases using polar Fourier basis. The second aspect exploits specific transforms to act on input. Spatial Transformer Network (STN) is a well-known representative, which predicts an affine matrix to transform its input to the canonical form. Tai et al. (2019) inherits this idea to design equivariant network. Another choice of transform is to the polar coordinate system (Henriques & Vedaldi, 2017; Esteves et al., 2018) . Since 2D rotation in Cartesian coordinate system corresponds to 2D translation in polar coordinate system, rotation equivariance can be achieved by conventional translation equivariant CNN. The approach proposed in this paper falls into the first category. Weiler & Cesa (2019) proves that all steerable convolutional operator could be denoted as the combination of a specific set of polar Fourier bases. However, it is not clear yet how this interpretation is related with the widely used filter transform scheme. In this paper, we aim to establish the missing connection between the group representation based analysis for steerable filters and filter transform scheme. To this end, we propose a new approach (FILTRA) to use filter transform to establish steerability between features in different group representation in cyclic group C N and dihedral group D N . We verify the feasibility of FILTRA for the classification and regression tasks on different datasets.

2. PRELIMINARIES

We make use of several NumPy or SciPy functions in equations including rollfoot_0 , flipudfoot_1 and circulantfoot_2 . We omit the variable in bracket sometimes by writing κ * * = κ * * (g) and K * * = K * * (φ). We recapitulate the basic concepts of steerable CNN which will be frequently used in this paper. For detailed introduction, readers can refer to Weiler & Cesa (2019) for a comprehensive information. We mainly consider the 2D image case and denote x ∈ R 2 as a pixel coordinate. We use vector field f (x) ∈ R C to denote a general multi-channel image, where C is the number of channels. Typical examples of f (x) include RGB image f (x) ∈ R 3 and gradient image f (x) ∈ R 2 . Consider a group G of transformations and an element g ∈ G. Examples of G include rotation, translation and flip. A vector field f (x) follows the below rules when undergoing the act π(g) of a group element g:

2.1. STEERABLE CNN

f (x) π(g) • f = ρ(g)f (g -1 x) g g ρ ≡ 1 ρ = ψ 1 π(g) • f = ρ(g)f (g -1 x), where ρ(g) is a group representation related to vector field f . Fig. 1 shows an example of different types of ρ for RGB images and gradient images under a rotation transform element g. The group representation of RGB is ρ(g) ≡ 1 while for gradient image ρ(g) is a 2D rotation matrix which also rotates vector f (x) by g. In the scenario of convolutional neural network, a convolution operator f → κ • f is considered as steerable if it satisfies κ • [π 1 (g)f ] = π 2 (g)[κ • f ], (2) i.e. the output vector field transforms equivariantly under g when the input is transformed by g.

2.2. REFLECTION GROUP, CYCLIC GROUP AND DIHEDRAL GROUP

We consider steerable filters on reflection group ({±1}, * ), cyclic group C N and dihedral group D N = ({±1}, * ) C N . To unify the notations in derivation, we interpret C N = ({1}, * ) C N and ({±1}, * ) = ({±1}, * ) C 1 = D 1 so that a element in these three groups can always be denoted as a pair g = (i 0 , i 1 ), whose range is Z 2 × Z 1 for reflection group, Z 1 × Z N for cyclic group and Z 2 × Z N for dihedral group. Each element in C N corresponds to rotation angle θ i1 = 2i1π N .

2.3. GROUP REPRESENTATION

A linear representation ρ of a group G on a vector space R n is a group homomorphism from G to the general linear group GL(n), denoted as ρ : G → GL(n) s.t. ρ(gg ) = ρ(g)ρ(g ), ∀g, g ∈ G. (3) We consider three types of linear representation in this paper, i.e. trivial representation, regular representation and irreducible representation (irrep). Readers can refer to Serre (1977) for further background for these concepts. The trivial representation of a group element is always ρ tri (g) ≡ 1. The regular representation of a finite group G acts on a vector space R |G| by permuting its axis. Therefore, for a rotation element g = (0, i 1 ) ∈ C N or D N , we get ρ C N reg (g) = P (i 1 ), ρ D N reg (g) = P (i 1 ) 0 0 P (i 1 ) , where P (i 1 ) = roll(I N , i 1 , 0). For a reflected element g = (1, i 1 ) ∈ D N , we get ρ D N reg (g) = 0 B(i 1 ) B(i 1 ) 0 , where B(i 1 ) = flipud(P (-i 1 -1)). (5) By selecting suitable change of basis of the vector space, a representation can be converted to a equivalent representation, which is the direct sum of several independent representations on the orthogonal subspaces. A representation is called irreducible representation if no non-trivial decomposition exists. This conversion is denoted as ρ(g) = Q (i0,i1)∈I ψ i (g) Q -1 , where I is an index set specifying the irreducible repsentations ψ i and Q is the change of basis.

2.4. DECOMPOSING REGULAR REPRESENTATION

We decompose the regular representation into a set of irreps. Define the following base irrep ψ j,k (i 0 , i 1 ) =        ((-1) j ) i0 k = 0 (-1) i1 • ((-1) j ) i0 k = N 2 , N is even cos(kθ i1 ) -sin(kθ i1 ) sin(kθ i1 ) cos(kθ i1 ) 1 0 0 (-1) i0 • ((-1) j ) i0 otherwise , where j, k are referred as the reflection and rotation frequency of the irrep. Concretely, if the action g reflects/rotates an object once, ψ j,k (g) will reflects/rotates in vector space j/k times. We also define the following discrete cosine transform basis V = β 0 β 1 • • • β N 2 , where β k =          1 N k = 0 cos(kθ 0 ) cos(kθ 1 ) • • • cos(kθ N -1 ) k = N 2 , N is even cos(kθ 0 ) cos(kθ 1 ) • • • cos(kθ N -1 ) sin(kθ 0 ) sin(kθ 1 ) • • • sin(kθ N -1 ) otherwise . ( ) The following decomposition for ρ C N reg (0, i 1 ) holds ρ C N reg (g) = V D C N V , D C N = 0≤k≤ N 2 ψ 0,k (0, i 1 ). The decomposition for ρ D N reg (i 0 , i 1 ) holds in a bit more complicated form, i.e. ρ D N reg (i 0 , i 1 ) = W D D N W , where W = V V V -V , D D N = 0≤j≤1,0≤k≤ N 2 ψ j,k (i 0 , i 1 ), and each column of W is refered by β j,k = β k (-1) j β k . See Fig. 2 for a visualization of this decomposition. We also mention a property of β k that is easy to verify and will be useful in our derivation. ψ 0,k (0, i 1 )β k = β k P (i 1 ), ψ 1,k (0, i 1 )β k = β k P (i 1 ), ψ 0,k (1, i 1 )β k = β k B(i 1 ), ψ 1,k (1, i 1 )β k = -β k B(i 1 ), where ψ 0,k (i 0 , i 1 ) rotates column vectors of β k as if they are circularly shifted.  ρ D N reg (g) = W D D N (g) W ρ D N reg (g) = W D D N (g) W ∈ G. Lemma 1. The map f → κ • f is equivariant under G if and only if for all g ∈ G, κ(gx) = ρ out (g)κ(x)ρ in (g) -1 . ( ) Weiler & Cesa ( 2019) proves that such filters can be denoted by a series of harmonic bases b(φ), i.e. κ(r, φ) = b∈K ω b (r)b(φ), where ω b (r) is the per radial weights and K is a set of harmonic bases as dervied in the appendix of Weiler & Cesa (2019) . For example, consider ρ in = ψ i,m and ρ out = ψ j,n in D N , K ψj,m←ψi,n = b µ,γ,s (φ) = ψ(µφ)ξ(s) µ = m -sn, s ∈ {±1} . 3 MAIN RESULTS (12) and ( 13) provide a general approach to verify and construct steerable CNN with different representations. In this section, we relate these theories with filter transform and show how to use filter transform to construct steerable filters with input/output of different representations. For readers who are not interested in group theory and mathematical derivation of the theory connection, we highlight the key equations to construct steerable filters in boxes. It should not be difficult to implement steerable filters directly from these equations using any modern deep learning framework. Fig. 3 shows illustration for these equations. In our derivation, we mainly consider the angular coordinate of polar coordinate functions κ(r, φ) and write them κ(φ). We will also frequently make use of the following property: κ(φ -θ 0 ) = κ(φ + θ 0 ), κ(φ -θ i ) = κ(φ + θ N -i ).

3.1. FROM TRIVIAL REPRESENTATION TO REGULAR REPRESENTATION

Rotation Group C N Consider the the rotating filter K and its reflected version K which are commonly used in previous works, e.g. TI-Pooling, ORN, RotEqNet and RotDCF: K(φ) = κ 0 κ 1 • • • κ N -1 , κ n (φ) = κ(φ -θ n ), K(φ) = κ 0 κ 1 • • • κ N -1 , κ n (φ) = κ(θ n -φ). The output of convolution with the above kernels naturally permutes as the input rotates in C N . This intuitively corresponds to property of a steerable filter transforming from trivial representation to regular representation. In this paper, we use K and K as the basic filters to construct different types of steerable filters in C N and D N . We verify the observation of the above steerability by substituting K into the lhs of Lemma 1 with g = (0, 1) and write: The above equation can be similarly verified for other g = (0, i 1 ) and also on K. Thus WLOG we select the steerable filter which transforms trivial representation to regular representation on C N as K(φ + θ 1 ) = κ(φ + θ 1 ) κ 0 • • • κ N -2 = κ N -1 κ 0 • • • κ N -2 (17a) = ρ C N reg (0, 1)Kρ tri (0, 1) -1 . (17b) K C N 0→reg = K K diag(K)β 0 k -diag(K)β 0 k diag(K)β 1 k -diag(K)β 1 k K D N 0→reg K C N k→reg K D N j,k→reg K C N 0→reg = K. Dihedral Group D N The steerable filter that transforms trivial representation to regular representation on D N can be constructed as K D N 0→reg (φ) = K K , which corresponds to enumerating each D N element and act on the kernel κ. For g = (0, i 1 ), K D N 0→reg can be verified to follow (12) in the same way as (17a), i.e. K D N 0→reg (φ+θ) = ρ D N reg (g)K D N 0→reg ρ tri (g) -1 . For reflected action, when g = (1, 1), we write: K(-φ + θ 1 ) = [κ(-φ + θ 1 ) κ(-φ -θ 0 ) κ(-φ -θ 1 ) • • • κ(-φ -θ N -2 )] = κ 1 κ 0 κ N -1 • • • κ 2 = B(1)K. Similarly, we can show for g = (1, i 1 ), K(-φ + θ i1 ) = B(i 1 )K, K(-φ + θ i1 ) = B(i 1 )K. ( ) Thus we verify (12) for the reflected actions g = (1, i 1 ) by summarizing the above as K D N 0→reg (-φ + θ i1 ) = ρ D N reg (g)K D N 0→reg ρ tri (g) -1 .

3.2. FROM IRREP TO REGULAR REPRESENTATION

Rotation Group C N Consider a C N irrep ψ 0,k (g) with frequency (0, k). We show that the following kernel K C N k→reg = diag(K)β k , transforms from ψ 0,k (g) to regular representation for the action g = (0, i 1 ). The derivation of correctness can be found in the appendix. Dihedral Group D N Consider a D N irrep ψ j,k (i 0 , i 1 ) with frequency (j, k). We show that the following kernel: K D N j,k→reg = K C N k→reg (-1) j • K C N k→reg (23) transforms from ψ j,k (i 0 , i 1 ) to regular representation for the action g = (i 0 , i 1 ) ∈ D N .

3.3. FROM REGULAR REPRESENTATION TO REGULAR REPRESENTATION

Regular representation possesses a nice property that it can be averaged, pooled or activated channelwise without violating steerability (Weiler & Cesa, 2019) . Thus it is convenient to used regular representation for the intermediate features of a steerable CNN. We show in this subsection that the following kernels can be use to construct a steerable kernel whose input and output features are both in regular representation. Rotation Group C N K C N reg→reg = K C N 0→reg • • • K C N N 2 →reg V -1 . ( ) Dihedral Group D N K D N reg→reg = K D N 0,0→reg • • • K D N 0, N 2 →reg K D N 1,0→reg • • • K D N 1, N 2 →reg W -1 . ( ) The above two kernels can be verified to transform regular representation to regular representation in similar way and we show the derivation for the C N case (25) as an example in the appendix.

3.4. REVERSED TRANSFORM OF REPRESENTATIONS

It is obvious to find that for (12), if ρ in , ρ out are orthogonal matrices, i.e. ρ -1 in = ρ in , ρ -1 out = ρ out , the transpose of ( 12) naturally proves the equivariance of κ under a reversed representation transform direction, i.e. from ρ out to ρ in . Thus we can easily obtain equivariance kernel from regular representation to trivial/irreducible representation by simply transposing ( 18), ( 19), ( 22) and (23).

3.5. CONVENTIONAL ROTATING FILTERS

We comprehensively study the approach to use filter rotation to form steerable convolutional kernels with regular representation features as input or output. Conventional filter rotation based networks exploit some basic forms introduced in this section. TI-Pooling (Laptev et al., 2016) exploits kernel K C N to transform trivial to regular representation, executes orientation pooling to convert regular to trivial representation and loses orientation information. RotDCF and ORN exploits a kernel of form K C N ORN = circulant(K). ( ) It is easy to verify that K C N ORN also follows Lemma 1 to be a steerable filter. However, compared to K C N reg→reg , K C N ORN consumes same filter storage but has less weight capacity (N v.s. N N 2 ). RotEqNet constructs 2D vector field which could rotate as its input rotates but regards the 2D vector field as independent trivial representation in convolution. As shown in this paper, it preserves better steerability to regards the vector field as irrep representation with frequency 1.

3.6. NUMERICAL ACCURACY FOR DISCRETE KERNELS

Note that when implementing discrete convolution, the equality of (17a) does not perfectly hold. For example, consider κ n (φ) = κ(φ -θ n ), κ n (θ n ) = κ(0) holds for a continuous κ. However, for discrete κ, κ n (φ) is a rotated interpolation of κ(φ) and this equality does not precisely hold in general. There exist some exceptions where the equality can be achieved for discrete κ. One example is when κ n (φ) is a 90 • rotation of κ and it can be precisely constructed from κ. Another example is when κ n is a 45 • rotation interpolated by nearest pixel from a κ of size 3 × 3. A conventional CNN is usually composed convolution, pooling, nonlinearity and fully-connected layers. To achieve equivariance for the overall network, it is desired that all the component layers are steerable. As analyzed in the appendix of Weiler & Cesa (2019) , channel-wise nonlinearity and channel-wise pooling preserves the steerability on feature maps with regular representation. fullyconnected layers is a special case of convolution with 1 × 1 kernels and thus can be easily realized by steerable convolution.

4. EXPERIMENTS

The proposed equivariant convolution, refered as FILTRA, can be interpreted as an alternative formulation for the harmonic based (Weiler & Cesa, 2019) implementation of steerable convolution. In this section we show the pros and cons of each implementation by experiments. We make use of the framework E2CNN (Weiler & Cesa, 2019) for our experiments as it provides the general interface and operations for steerable CNN network. Experiments are executed on the MNIST, KM-NIST (Clanuwat et al., 2018) , FashionMNIST (Xiao et al., 2017) , EMNIST (Cohen et al., 2017) and CIFAR10 datasets. We compare FILTRA against two convolution operations, i.e. the representative harmonic based convolution R2Conv (Weiler & Cesa, 2019) from E2CNN and the conventional vanilla convolution. All MNIST-like datasets are experimented on a same feature extraction backbone as described in Table 1a , with convolution operator realized by the three experimented approaches. CIFAR10 is experimented with WideResNet (Zagoruyko & Komodakis, 2016) in the setting similar to Weiler & Cesa (2019) . We found that on CIFAR10, C 4 steerable network performs better than C 8 for both approaches. For all experiments, we randomly rotate or reflect according to the experiment settings. The settings and evaluation results are listed in Table 2 . Different from Weiler et al. (2018) , we force the three convolution kernels to output same number of channels. For example, compared to vanilla convolution, the number of free weights for a C 8 FILTRA is reduced to 1/8 and for a D 8 is reduced to 1/16. The filters for all the approaches will thus have exactly same shape at the deploy stage. Experiments are executed on GTX 2070. The training procedure of FILTRA and R2Conv can both be implemented as a vanilla convolution plus a filter generation step. For C 8 case the runtime of both generator is similar and for D 8 case FILTRA is slightly faster. We show runtime of D 8 case in Table 1a at training stage. R2Conv additionally requires a initialization of about 2 min. Both of the approaches consume the same inference time as of vanilla convolution. 

4.1. CLASSIFICATION TASK

The most typical experiment used in previous works on conventional steerable CNN is the classification task. We follow this convention and compare the classification performance of the experimented three approaches in Table 2 . FILTRA show comparable performance to R2Conv and slightly improves accuracy for OCR-like (*MNIST) tasks where high frequency texture is limited. On CIFAR10, the performance of FILTRA is minorly disadvantageous. The explanation comes in the interpolation artifacts mentioned in Subsect. 3.6. As the interpolation of high frequency components deviates more, this harms the performance on CIFAR10 with high frequency texture.

4.2. REGRESSION TASK

Besides the typical classification task, we find that the property of steerability is naturally advantageous for many regression tasks whose input might rotate or reflect. In this paper, we evaluate the regression performance with an example task to predict the character direction. Similar tasks are commonly used in OCR techniques. When the character rotates, the predicted direction should rotate with the same rotating frequency. This means the predicted 2D direction vector is following a irrep ψ 0,1 for C N . We reuse the backbone in Table 1a to extract features and use a regression head in Table 1c to predict a unit 2D vector denoting the direction. The network is trained with MSE loss. Note that the images should be masked by a disk to avoid the network to overfit the direction from rotated black boundary. Different approaches are evaluated by the mean included angle between the predicted and groundtruth directions as shown in Table 2 . FILTRA with C 8 steerability performs best when trained on data augmented over SO(2). We owe this to the fact that FILTRA weight is naturally organized by the discrete grid layout. Each element of discrete weight matrix contribute to one more DoF of the filters. In contrast, R2Conv uses filters parameterized with a polar coordinate. The DoF of the filters is slightly reduced due to the discretization.

5. CONCLUSIONS

In this paper, we establish the connection between the recent steerable CNN structure based on group representation theory and the conventional transformed filters. To this end, we propose an approach to construct steerable convolution filters, which transform between features in trival, irreducible and regular representations. We verify the feasibility of FILTRA for the classification and regression tasks on several datasets.



https://numpy.org/doc/stable/reference/generated/numpy.roll.html https://numpy.org/doc/stable/reference/generated/numpy.flipud.html https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.circulant.html



Figure 1: Examples of images (feature maps) with different group representation ρ. Both images undergo 90deg rotation. The upper row is an RGB image whose 3-channel colors remain the same when the image is rotated. The lower row is a gradient image whose two channel value should be rotated in the same way when the gradient image is rotated.

Figure 2: Illustrations for (10) for g = (0, 1) at left and g = (1, 1) at right. Red, light yellow and green denotes negative, 0 and positive values, respectively.

Figure 3: Visualization of FILTRA filter examples. Based on a same weight kernel K, we generate filters K C N 0→reg , K D N 0→reg , K C N k→reg and K D N j,k→reg . In this example we set j = 1, k = 1, N = 8. The two-columns of matrix β k is splitted as β 0 k and β 1 k for visualization. Red, light yellow and green denotes negative, 0 and positive values, respectively. Please view this figure in color.

The backbone network structure used in our experiments is composed by convolution, ReLU and pooling layers. The convolution layers are realized by FILTRA, R2Conv and conventional convolution respectively while the rest layers remain the same. Three realizations have the same number of output channels in each layer but organize the channels to be follow regular representation for FILTRA and R2Conv. k: kernel size. s: stride. δt: filter generation time in ms. Network structure in experiments 3.7 STEERABLE CNN WITH MULTIPLE LAYERS

