COMPACT BILINEAR POOLING VIA GENERAL BILIN-EAR PROJECTION

Abstract

Many factorized bilinear pooling (FBiP) algorithms employ Hadamard productbased bilinear projection to learn appropriate projecting directions to reduce the dimension of bilinear features. However, in this paper, we reveal that the Hadamard product-based bilinear projection makes FBiP miss a lot of possible projecting directions, which will significantly harm the performance of outputted compact bilinear features, including compactness and effectiveness. To address this issue, we propose a general matrix-based bilinear projection based on the rank-k matrix base decomposition, where the Hadamard-based bilinear projection and Y = U T XV are special cases of our proposed one. Thus, our proposed projection can be used improve the algorithms based on the two types of bilinear projections. Using the proposed bilinear projection, we design a novel low-rank factorized bilinear pooling (named RK-FBP), which considers the feasible projecting directions missed by the Hadamard product-based bilinear projection. Thus, our RK-FBP can generate better compact bilinear features. To leverage high-order information in local features, we nest several RK-FBP modules together to formulate a multi-linear pooling that outputs compact multi-linear features. At last, we conduct experiments on several fine-grained image tasks to evaluate our models, which show that our models achieve new state-of-the-art classification accuracy by the lowest dimension.

1. INTRODUCTION

Bilinear pooling (BiP) (Lin, 2015) and its variants (Li et al., 2017b; Lin & Maji, 2017; Wang et al., 2017) employ Kronecker product to yield expressive representations by mining the rich statistical information from a set of local features, and has attracted wide attentions in many applications, such as fine-grained image classification, visual question answering, etc. Although achieving excellent performance, the bilinear features suffer from two shortcomings: (1) the ability of BiP to boost the discriminant information between different classes also magnifies the intra-class variances of representations, which makes BiP easily encounter the burstiness problem (Gao et al., 2020; Zheng et al., 2019) and suffer from a performance deficit; (2) the Kronecker product exploited by BiP usually makes the bilinear features exceptionally high-dimensional, leading to an overfitted training of the succeeding tasks and a hefty computational load by increasing the memory storage. Thus, how to effectively solve the shortcomings of BiP is an important issue. Several approaches have been proposed (Gao et al., 2016; Fukui et al., 2016; Yu et al., 2021; Li et al., 2017b; Kim et al., 2016) to solve the shortcomings of BiP. Among them, the factorized bilinear pooling (FBiP) methods (Li et al., 2017b; Kim et al., 2016; Amin et al., 2020; Yu et al., 2018; Gao et al., 2020) have been promising leads. The essence of FBiP performs a dimension reduction operation on bilinear features. It finds a linear projection to map bilinear features into a low-dimension space with their discriminant information among classes preserved using the least dimensions, and then employs L 2 -normalization to project those low-dimension features on a hyper-sphere. (Bilinear pooling equals the non-linear projection determined by the polynomial kernel function k(x, y) = (< x, y >) 2 Gao et al. (2016) , which makes bilinear features probably linear discriminant.) Thus, the information reflecting the large intra-class variances is abandoned because they do not help distinguish different classes. In this way, the burstiness and high dimension problems are solved simultaneously (Gao et al., 2020; Wei et al., 2018) . This procedure is depicted in Figure 1 . The sub-figure (a) shows a set of samples with a large variance. Sub-figure (c) is Compared (e) with (f), we know the intra-class variances in region 2 and 4 are reduced but those in 1 and 3 are increased in (e). Thus, an accurate linear projection is better than "signed-square-root" strategy for solving the burstiness problem. the result after the linear projection and L 2 -normalization, the intra-class variance is reduced significantly. Actually, the information along e 1 can be completely discarded, and the dimension of data becomes 1. Let us compare sub-figure (c) with (e), we can find FBiP outperforms the combination of the signed-square-root transformation and L 2 -normalization . To achieve such a good performance, the key step is to find the accurate projecting direction e 2 , or the dimension and intra-variance reduction can not achieve successfully. Because of the extremely high dimension of bilinear features, it is not easy to find appropriate projecting directions by traditional linear projection. Thus, how to find a set of parameter-efficient model to accurately depict those appropriate directions is curial for solving the shortcomings of bilinear features. Most FBiP approaches formulate the projection for dimension reduction as a Hadamard product-based bilinear projection (Kim et al., 2016; Gao et al., 2020; Li et al., 2017b) : (𝒖1 𝑇 𝒙𝒔)(𝒗1 𝑇 𝒚𝒕)) (𝒖1 𝑇 𝒙𝒔)(𝒗2 𝑇 𝒚𝒕)) ⋮ (𝒖2 𝑇 𝒙𝒔)(𝒗1 𝑇 𝒚𝒕)) (𝒖2 𝑇 𝒙𝒔)(𝒗2 𝑇 𝒚𝒕)) ⋮ (𝒖5 𝑇 𝒙𝒔)(𝒗4 𝑇 𝒚𝒕)) (𝒖5 𝑇 𝒙𝒔)(𝒗5 𝑇 𝒚𝒕)) (a) (b) f = P T (U T x s • V T y t ) where U and V are two learnable variables, respectively, and P is a variable defined various from algorithms. Each dimension of f can be seen as a linear combination of dimensions of the Hadamard product z = (U T x s • V T y t ). Consider each dimension of z shown in Figure 2 . The Hadamard product only considers the values {(u T i x s )(y T t v j )|i = j)} and ignores values {(u T i x s )(y T t v j )|i ̸ = j} which are considered by Kronecker product. In this paper, we prove that those ignored values are important for dimension reduction, because they are coefficients of bilinear features on feasible matrix projecting directions {v i u T j |i ̸ = j}. Thus, missing them will lead to inaccurate projecting directions of the dimension reduction, which inevitably affects the overcoming for shortcomings of BiP greatly. Consequently, the effectiveness and compactness of FBiP are seriously harmed. In this paper, from the perspective of finding accurate and parameter-efficient matrix projecting directions, we analyze the decomposition on bases of a matrix space, then propose a general bilinear projection based on decomposed rank-k matrix bases. Because of the solid mathematical foundation of the proposed bilinear projection, it can be seen as a baseline to analyze current FBiP. Employing our novel bilinear projection, we formulate a new FBiP model without missing possible projecting directions. The contributions are listed as follows: (1) We make a detailed analysis to demonstrate why the traditional FBiP tends to miss a lot of feasible projecting directions. Based on our analysis, we propose a general bilinear projection that calculates the coefficients of matrix data on a set of complete decomposed rank-k matrix bases. (2) Based on the proposed general bilinear projection, we design a new FBiP method named rank-k factorized bilinear pooling (RK-FBP). Because of the capability to learn accurate projecting directions, the calculated bilinear features are highly compact and effective. Utilizing this property, we nest several RK-FBP modules to calculate multi-linear features that are still compact and effective. (3) We conduct experiments on several challenging image classification datasets to demonstrate the effectiveness of the proposed RK-FBP. Compared with state-of-the-art BiP methods, our model can output extremely compact bilinear features (the dimension is 512) and achieve comparable or better classification accuracy. Notation. Throughout this paper, T r(•) denotes the trace of a matrix. vec(•) is the matrix vectorization (i.e., reshaping a matrix to a vector by stacking its columns on top of one another). rank(•) represents the rank of a matrix. The ij-th element of the matrix A ∈ R m×n is represented by the symbol (A) ij , and the i-th element of x ∈ R d is represented by (x) i .

2. PRELIMINARY

2.1 DIMENSION REDUCTION ON BILINEAR FEATURES Given a set of training samples, among which each instance yields two groups of local features denoted by (Lin, 2015; Kong & Fowlkes, 2017) ). Bilinear pooling (BiP) integrates those two groups of local features into an expressive representation by the following operation. X f = {x s ∈ R m } p s=1 and Y f = {y t ∈ R n } q t=1 (in some cases, Y f = X f X = (s,t)∈S xsy T t ∈ R m×n (1) where X is the bilinear feature, and S is the pair set of local features. For convenience, we write bilinear feature as X = x s y T t in the following content by ignoring the summation symbol. Because most deep neural networks satisfy |S| < min{m, n}, the bilinear feature X is a low-rank matrix. Such property can help to design parameter-efficient dimension reduction algorithms. Factorized bilinear pooling (FBiP) reduces X to a h-dimensional vector by linear projection f = [T r(XW T 1 ), • • • , T r(XW T h )] where W r ∈ R m×n is the r-th low-rank matrix projecting direction. For parameter-efficiency, each W r is decomposed into small matrices in various ways, which leads to different FBiP algorithms. For example, by decomposing W r as W r = U r V T r where U r ∈ R m×k and V r ∈ R n×k , k is the rank of W r . The r-th dimension of f can be rewritten as follows. (f)r = T r xsy T t VrU T r = 1 T (U T r xs • V T r y t ) where • is Hadamard product and 1 is a vector with all elements equaling 1. The Eq.( 2) is adopted by FBC (Gao et al., 2020) and LowFER (Amin et al., 2020) . Besides, some FBiP algorithms (Kong & Fowlkes, 2017; Wei et al., 2018) let local feature sets X = Y and assume W r be a symmetric low-rank matrix. The W r is decomposed as W r = U + r (U + r ) T -U - r (U - r ) T where U + r and U - r are learned smaller matrices. Thus, let U r = [U + r , U - r ], r-th dimension also can be rewritten as a Hadamard product-based projection: (f) r = T r x s y T t U + r (U + r ) T -T r x s y T t U - r (U - r ) T = [1 T , -1 T ](U T r x s • U T r y t ) Some algorithms (Kim et al., 2016; Yu et al., 2018; Kim et al., 2018) replace 1 in Eq.(2) by a learnable vector p r and transform Eq.( 2) to a more general formulation presented as follows. f = P T U T x s • V T y t where P = [p 1 , • • • , p h ] ∈ R l×h , U ∈ R m×l , and V ∈ R n×l are learnable matrices, respectively. Eq.( 4) is a general formulation of Eq.( 2) and Eq.( 3), mathematically, so we can only focus our analysis on Eq.(4).

2.2. SHORTCOMINGS OF TRADITIONAL FACTORIZED LOW-RANK BILINEAR POOLING

The linear dimension reduction finds the least projecting directions to mapping samples into the low-dimensional space with their discriminant information being preserved. Thus, the projecting directions should satisfy the following criteria: (1) they preserve discriminant information of samples well, otherwise the classification performance will be poor; (2) those projecting directions are linear independent, or the calculated low-dimensional embeddings can be reduced further. Next, we demonstrate that the bilinear projection in Eq.( 4) can not find suitable projecting directions and harms the compactness and effectiveness of the compact bilinear features. Before doing this, we first introduce a theorem presented as follows. Theorem 1. Suppose U = {u p ∈ R m×1 } l1 p=1 and V = {v q ∈ R n×1 } l2 q=1 are two groups of linear independent vectors in spaces R m×1 and R n×1 , respectively. If the vector set W = {vec(W i ) ∈ R mn×1 } l1l2 i=1 is constructed by W i = u p v T q where i = l 1 (q -1) + p, then W is a set of linear independent vectors. The Proof is attached in the Appendix. If l 1 = m and l 2 = n, W is a complete bases of R mn×1 . The r-th element of the low-dimensional feature f presented in Eq.( 4) can be rewritten as (f)r = l j=1 (p r )jvec(ujv T j ) T vec(xsy T t ) where (p r ) j is the j-th element in the r-th column of P. As seen from Eq.( 5), (f) r is the coefficient of the bilinear feature vec(x s y T t ) projected on the projecting direction l j=1 (p r ) j vec(u j v T j ) which is a vector in the linear space spanned by vectors B = {vec(u 1 v T 1 ), vec(u 2 v T 2 ), • • • , vec(u l v T l )}. Then, we prove that the linear combinations of vectors in B can not express all the possible projecting directions. Let us start from the case l ≤ max{m, n}. Because learning a low-rank matrix is a challenging problem in the machine learning community (Candès et al., 2011; Liu et al., 2012; Wright et al., 2009) , U and V in Eq.( 4) are much likely to be full rank matrices because of no additional constraints on them. Thus, columns in U (and V) are linear independent. Figure 3 gives experimental Proof to support this assumption. Because the columns of U trained by the FBiP model of Eq.( 4) are nearly vertical, U is a full rank matrix. According to Theorem 1, columns of U and V can generate a set of linear independent vectors W. Obviously, B is a small subset of W. According to criterion (2), not only vectors in the space spanned by B but also vectors spanned by W -B can generate the possible projecting directions for reducing dimensions of samples in R mn×1 . Worse, such a serious issue can not be alleviated by increasing l. Without loss of generality, we let l > m = n and the first m columns in U and V be linear independent. Thus, the last l -m columns in U (or V) can be represented by the first m columns. To be specific, the r-th (r > m) columns in U and V are u r =  u m+1 v T m+1 ), • • • , vec(u l v T l )]. According to Theorem 1, {vec(u p v T q )} m,n p=1,q=1 is a complete base set in R mn×1 . So vec(u r v T r ) can be any mn-dimensional vector by giving auxiliary parameters {a rp b rq } m,n p=1,q=1 suitable values. However, because those auxiliary parameters are implicit, we can not train them as other parameters of our model. Thus, the learnable projecting directions can be any possible ones in the solution space. The worst case is that L is a rank-1 matrix, which means most of its columns do not preserve the discriminant information of samples. Thus, we can transform L to one column without performance reduction. It implies that the missed projecting directions can not be found by increasing the value of l. Thus, FBiP can not find suitable projecting directions. Of course, auxiliary parameters can let L be a full rank matrix. In this case, the projecting directions will be in the space spanned by the missed directions W -B. The performance of deep learning crucially depends on tricks in stochastic gradient descent strategies, such as 'momentum' and 'decay'. Because those auxiliary parameters can not be improved by those tricks, their values may not as good as desired, which probably make the learned projecting directions unsuitable. Thus, the performance of FBiP will be poor. To alleviate this issue, FBiP should adopt more projection directions to capture enough discriminant information. Nevertheless, due to keeping large intra-class variances, the performance of those unsuitable solutions has an upper bound less than that of the suitable projecting directions. Consequently, the effectiveness and compactness of outputted bilinear features are harmed. For the case U = V, the following Corollary can demonstrate Eq.( 4) also misses a lot of feasible projecting matrices. The analysis can be made in the same way mentioned above, needing to replace Theorem 1 with the Corollary. Thus, we do not present the analysis in our paper. Corollary. Given a set of linear independent vectors {u i ∈ R m×1 } l i=1 , the symmetry matrices S = { (uiu T j +uj u T i ) 2 ∈ R m×m } l,l i=1,j=1 are also linear independent. The Proof is attached in the Appendix. Theorem 1 demonstrates that the rank-1 matrix base set can be decomposed into two vector base sets. According to Theorem 1, we can derive the traditional bilinear projection Y = U T XV which is famous in the fields of linear algebra (Strang et al., 1993) and machine learning (Pirsiavash et al., 2009; Nie et al., 2018) . The conclusion is presented in the following theorem.

3. GENERAL BILINEAR PROJECTION

Theorem 2. If the coefficient of X ∈R m×n on the base u p v T q is calculated as y pq = T r(X T u p v T q ), then the coefficients of X on the whole base set W = {u p v T q |u p ∈ U, v q ∈ V} m,n p=1,q=1 are {y pq } m,n p=1,q=1 which can form a coefficient matrix Y ∈ R m×n satisfying: Y = U T XV where U = [u 1 , • • • , u m ] ∈ R m×m and V = [v 1 , • • • , v n ] ∈ R n×n . The Proof is presented in the Appendix. Remark 1. Theorem 2 gives a way to decompose a matrix base set into two low-dimensional vector base sets. A base set corresponds to a linear projection in the linear space. Thus, the solution to find an appropriate projection in a matrix space (a high-dimensional space) can be transformed to find two vector projections in the low-dimensional vector spaces, which will alleviate the overfitting problem and save the computational sources, including running time and memory storage. Theorem 2 is a mathematical interpretation of why so many matrix-based algorithms modeled by the projection Y = U T XV (Pirsiavash et al., 2009; Nie et al., 2018; Fukui et al., 2016) work well. However, according to Theorem 2, the feature reduction algorithms based on Eq.( 14) only find rank-1 matrix projections. Rank-1 projections are a small port of feasible projections. If the applications prefer high-rank projections, Eq.( 14) will miss them and lead to a performance deficit. In the next section, we will explore a more general bilinear projection which is based on matrix bases with high rank, i.e., k > 1.

3.2. FORMULATION OF THE GENERAL BILINEAR PROJECTION

Consider the coefficient of an arbitrary matrix X ∈ R m×n projected on a rank-k matrix base W p ∈ W m×n k . The coefficient can be calculated as y p = T r(XW T p ) = T r(U T p XV p ) by decomposing W p as W p = U p V T p where U p ∈ R m×k and V p ∈ R n×k . Since T r(U T p XV p ) = vec T (XV p )vec(U p ) = vec T (V p )(I k ⊗ X)vec(U p ) , the calculation of y p is equivalent to: y p = T r (I k ⊗ X)vec(V p )vec T (U p ) where ⊗ represents the Kronecker product, I k is the k × k identity matrix. Eq.( 7) indicates that the coefficient y p equals the coefficient of I k ⊗ X projected on the matrix vec(V p )vec T (U p ) in R mk×nk . As the analysis presented in the previous section, the term vec(V p )vec T (U p ) consists of a set of free implicit variables that make the appropriate projecting directions hard to learn. To overcome this shortcoming, we employ the property presented in Theorem 1 to separate the implicit variables from the linear independent projecting directions. Since mn ≪ mnk 2 , the matrices {vec(V p )vec T (U p )} mn p=1 are more likely located in a subspace of R mk×nk whose matrix base set can be decomposed into two vector base sets {û i } l1 i=1 and {v i } l2 i=1 . Thus, for each vec(V p )vec T (U p ), we can find a matrix L p ∈ R l1×l2 to hold the following equation: vec(Vp)vec T (Up) = l 2 i=1 l 1 j=1 (Lp)ijviû T j (8) where (L p ) ij is the ij-th element in the matrix L p , l 1 ≤ mk and l 2 ≤ nk. By substituting Eq.( 8) in Eq.( 7), we obtain a new equation to calculate y p by employing L p , {v i } l2 i=1 and {û j } l2 j=1 . Assigning {y p } h p=1 in a vector denoted by y = [y 1 , • • • , y h ], the general bilinear projection is presented as follows: y = P T vec(U T (I k ⊗ X)V) (9) where k indicates the rank of the matrix base set, X ∈ R m×n , U = [û 1 , • • • , ûl1 ] ∈ R mk×l1 , V = [v 1 , • • • , vl2 ] ∈ R nk×l2 , P = [vec(L 1 ), • • • , vec(L h )] ∈ R (l1l2)×h . Worthy of note is that the sizes of U and V in Eq.( 9) become mk × l 1 and nk × l 2 respectively, which are different from the ones presented before in this paper. Remark 2. Worthy of note is that P stores the free implicit variables mentioned in the previous section. In our proposed projection, they become explicit variables whose values are easy to constrain. Unlike Eq. ( 2), our bilinear projection considers all feasible projecting directions to reduce dimension and intra-class variance. Thus, our projection facilitates yielding more compact features than other FBiP approaches. Remark 3. BiP methods MPN (Li et al., 2017a) , MPN-COV (Wang et al., 2020) , iSQRT-COV (Li et al., 2018) , SMSO (Yu & Salzmann, 2018) and DBTNet-50 (Simonyan & Zisserman, 2014) reduce the dimension of local feature x s by a linear projection xs = U T x s and then calculate low-dimension bilinear features as U T x s x T s U. According to our analysis, it equals to reduce the dimension of x s x T s by rank-1 matrix projecting directions, which misses much information. From Eq.( 9), the projection on local features should be xs = U T (I k ⊗ x s ). The term I k ⊗ X in Eq.( 9) has a lot of zeros which cost lots of memory storage. By introducing the matrix partition as U = [U T 1 , • • • , U T k ] T and V = [V T 1 , • • • , V T k ] T , we can reformulate our general bilinear projection as y = P T vec( k i=1 U T i XV i ). where U i ∈ R m×l1 , V i ∈ R n×l2 and P ∈ R l1l2×h , respectively. 4 RANK-k FACTORIZED BILINEAR POOLING 4.1 FORMULATION Bilinear pooling. Following traditional FBiP method (Kim et al., 2016) , we employ the Eq.( 9) to reduce the dimension of the bilinear feature x s y T t , and formulate a new compact bilinear pooling method named rank-k factorized bilinear pooling (RK-FBP) presented as follows. l1l2) , respectively. fi = P T vec k r=1 (U T r xs + b u r )(V T r y t + b v r ) T + b p (10) where b u r ∈ R l1×1 , b v r ∈ R l2×1 and b p ∈ R h×1 are the bias terms of projections U r ∈ R m×l1 , V r ∈ R n×l2 and P ∈ R h×( Remark 4. As seen from Eq.( 10), let us set l 1 = l 2 = h. If we fix P ∈ {0, 1} h×h 2 in which only the (i, (i -1)h + i)-th element equals "1". Then our model is equivalent to the bilinear pooling model used in (Gao et al., 2020; Amin et al., 2020) . If we set the (i, (i -1)h + i)-th element of P, i.e., P ik , as learnable parameters while other elements are set as "0", our model is equivalent to the bilinear pooling module used in (Kim et al., 2016; Lu et al., 2016; Yu et al., 2017) . Multi-linear pooling. We employ the proposed bilinear projection to generate high-order pooling features. However, the dimension of high-order pooling features is huge, so it is intractable to modules to integrate high-order information into the multi-linear representations. From left to right, the 1-th, 2-th and last modules integrates the second-order information, the third-order information, and the fourth-order information, respectively. Each RK-FBP has its own parameters. At last, we choice several types of those features and concatenate them to a more informative one. perform the dimension reduction on the high-order features directly. Thus, we give a recursive way to output the compact high-order pooling features by utilizing several RK-FBP modules. Without ambiguity, we denote f t i as the t-th order pooling feature, then the (t + 1)-th order pooling feature can be calculated as follows. f t+1 i = P T vec k r=1 (U T r xs + b u r )(V T r f t i + b v r ) T + b p (11) For obtained features {f t i } O t=1 , we concatenate them into one feature vector, i.e., f i = [(f 1 i ) T , • • • , (f O i ) T ] T before feeding them into the classifier. Considering the information of different orders may be conflicted with each other, some order pooling features can be ignored in the concatenation. The structure of our model is depicted in Figure 4 . Memory Analysis. Compared with the fully bilinear pooling algorithm (Lin, 2015) , our method results in a great saving of memory storage and computational load. For c-th classification tasks, U, V and P and the classifier hyperplane have kml 1 + knl 2 + l 1 l 2 h + ch elements. In image classification tasks adopting ResNet50 as backbone, there are m = 2048, n = 2048, BiP requires mnc = 200mn ≈ 10 9 parameters to output the classification result. In our RK-FBP, we set l 1 = l 2 = 300, h = 512 and k = 4 at most, there is about 10 7 parameters, which is smaller than that of BiP by two orders of magnitude, and same as that of traditional factorized bilinear pooling algorithm, e.g., FBC (Gao et al., 2020; Fang et al., 2019) , etc. Due to having a smaller amount of parameters, the computational load is also reduced. Because the high-order RK-FBP is recursively constructed by nesting several RK-FBP modules together, its storage memory is several times of the singular RK-FBP. Considering a T -fold multi-linear feature has m T dimension for a m-dimension first-order feature, the weights of our multi-linear version of RK-FBP are much smaller.

5.1. EXPERIMENTAL SETTING

We conduct experiments on the image classification tasks to evaluate the proposed RK-FBP model. The adopted image datasets are Describing Texture Dataset (DTD) (Cimpoi et al., 2014) , MINC-2500 (MINC) (Bell et al., 2015) , MIT-Indoor (INDOOR) (Quattoni & Torralba, 2009) , and Caltech-UCSD Bird (CUB200) (Xie et al., 2013) , Cars196 (Krause et al., 2013) , Aircraft (Maji et al., 2013) which are the texture image dataset in the wild, the material dataset, the indoor scene dataset, and two fine-grained image datasets, respectively. We incorporate our RK-FBP module in Deep Neural Networks as Figure 4 , and implement it by PyTorchPaszke et al. (2017) . ImageNet-pretrainings are taken from torchvision Marcel & Rodriguez (2010) . The adopted backbones are VGG-16 and ResNet50. For those image data, each image is resized into 418 × 418. VGG-16 generates a 28 × 28 × 512 feature map from con5 -3 layer. The number of local features d = 512 and number of local feature is 784. For ResNet50, it generates a 14 × 14 × 2048 feature map which result 196 2048-dimensional features. Following Gao et al. (2016) , the compact bilinear feature f are sent to the softmax classifier. Each RK-FBP module has its own parameters U, V, and P, which are updated by the back-propagation algorithm. The Pytorch is run on GPU Quadro RTX 6000. The optimization is done using SGD, weight decay is 0.01, learning rate is 0.01, momentum is 0.9. The number of epoch is 60 and batch size is 32. 6. We can find that While l = 250 and k = 2, RK-FBP achieves the best performance 83.5%. When l = 250 and k = 1, the accuracy is about 81.9% with about 2.6% decrease. When l => 150, the accuracy is relatively stable after k > 2. The accuracy variance of k for each l is within 2%. It means only k = 2 is enough. But when l < 150, the maximal accuracy corresponds large k = 9. This means when the number of matrix bases is less, a large rank is preferred. Because large rank matrices have more parameters, so they can depict the projecting directions more accurate. Thus, the observation may indicates fewer dimensions need more accurate projecting directions to capture the discriminant information. Besides, we employ xs = U T (I ⊗ x s ) to improve the dimension procedure in SMSO (Li et al., 2017a) , MPN (Yu & Salzmann, 2018) , iSQRT-COV (Li et al., 2018) . As seen from Table 1 , we can find that increase of the rank k, the accuracy of those methods can be improved. And the maximum gain is 6.6% achieved by MPN on the Indoor dataset. Besides, we can find that with the increase of k, the accuracy will increase first and then be stable. This is consistent with our RK-FBP when the number of matrix bases is s small. Both experiments illustrate the importance of rank k. This indicates that current dimension reduction methods in those bilinear pooling models limit their expressive ability. Dimensionality h. We compare our model with FBC, RK-FBP, and RK-HFBP (presented in Eq.( 38)) on Indoor. We report the testing accuracy at h = [128, 256, 512, 1024, 2048, 3072, 4096, 8192] . We set the rank k = 2 for our RK-FBP and l 1 = l 2 = 250 for RK-FBP. For RK-HFBP and FBC, the rank of them are empirically set as k = 8 and k = 14, respectively. As seen from Figure 6 . RK-FBP outperforms FBC and RK-HFBP with different dimensionality. Specifically, RK-FBP is higher than FBC about 4.1% when h = 512. It validates that our model can find more discriminative projecting directions than RK-HFBP and FBC. After the dimension increase as h > 512, the performances of RK-FBP are relatively stable. Considering RK-FBP achieves the best result among the three approaches, this indicates that the performance of RK-FBP saturates. Nevertheless, for FBC and RK-HFBP, their performance continues to increase when h > 512. It may be because they can not find accurate projecting directions, so more discriminant information is acquired when the feature dimension increases. However, they can not surpass the RK-FBP. This implies that the missing discriminant information can not be completely solved by increasing the dimension of the compact bilinear features.

5.3. COMPARISON WITH THE STATE-OF-THE-ART ALGORITHMS

VGG-16 backbone. We first compare with full bilinear pooling methods: BCNN, improved BCNN, DeepO2P , G 2 DeN et, RUN, DeepKSPD. Except BCNN, those methods adopt enhanced normalization strategies to overcome the bursitness problem. As seen from Table 2 , only G 2 DeN et slightly surpasses our RK-FBP on CUB200 dataset by 0.6%. On the rest five datasets, our RK-FBP achieves the best results, and the largest gain is 8.6% (more than DeepO2P) on the MINC dataset. Then, we further compare with the medium-scale bilinear features, ReDro and iSQRT-COV. ReDro is a grouped bilinear pooling method which reduces the dimension of bilinear features to 33K. As shown in Table 2 , Our RK-FBP achieves comparable results with ReDR. Specifically, RK-FBP is more than ReDro by 0.1% and 0.3% CUB200 and Car196, and 3.3% and 0.9% on Indoor and Aircraft datasets, respectively. iSQRT-COV reduces the dimension of the local features from 512 to 256 using a 1 × 1 convolution layer. So, the dimension of its bilinear features is 33K too. Our RK-FBP outperforms iSQRT-COV on the rest five datasets except CUB200. Especially, RK-FBP significantly surpasses iSQRT-COV by 4.0% on Indoor. Because the ReDro and iSQRT-COV are designed to overcome the burstiness of bilinear features, their dimensions are relatively low. We conclude two conclusions: (1) those positive results over the high dimensional BiP methods indicate that high dimension limits the efficiency of bilinear features in the succeeding classification tasks. (2) Considering that the above comparison methods employ matrix normalization strategies, the comparable results mean our RK-FBP can effectively solve the bursitness problem caused by the intra-class variances, that indicating that our method can find suitable projecting directions. We also compare several state-of-the-art compact bilinear pooling methods: CBP, LRBP, FBC, and TKPF. As for those methods, we report the best results and their corresponding dimensions. As shown in Table 2 , our RK-FBP achieves better classification accuracy using the lowest dimension. Compared to 4K, the smallest dimensions of those comparison methods, 512 and 1024 are extremely compact. It is mainly because those compact methods can not find suitable projecting directions since they miss a lot of possible directions. As for the number of parameters, our RK-FBP uses more parameters than other compact methods. However, compared with the importance of accurate projecting directions, the cost of such an amount of storage is acceptable. Most of our model's parameters are contributed by P, whose parameters can be reduced by constraining it using sparsity regularization. It will be done in our future work. At last, we report the performance of our multi-linear models (3 + + 2 + means capturing the 2order and the 3-th order information). As seen from Table 2 , our multi-linear model achieves the best accuracy while the dimension is 1024. Especially, our multi-linear features surpass HPB by 0.2%, where HBP is an enhanced bilinear pooling method fusing features across different layers by Hadamard product. Because the dimension of traditional compact bilinear pooling is high, it is hard to calculate multi-linear features by nesting them together. This excellent result shows the benefit of extremely low dimension features in the community. ResNet50 backbone. As shown in Table 2 , our models (bilinear and multi-linear) surpass other comparison methods on most datasets for the comparison methods, which indicates our model is robust to the backbones. Besides, we also compare with DBTNet-50 is a deep structure constructed by bilinear transformation blocks. We find that DBTNet-50 surpasses RK-FBP on CUB200 and Cars196 being weaker than RK-FBP on Aircraft. However, considering that DBTNet-50 applies the bilinear pooling in every layer, the results reported are not significant. Besides, our multi-linear features can outperform DBTNet-50 on most datasets with smaller dimensions. It may be because the transformation used DBTNet-50 is based on the rank-1 matrix bases whose learning ability is limited. So DBTNet-50 may be improved by our proposed rank-k bilinear projection.

6. CONCLUSION

In this paper, we reveal that traditional factorized bilinear pooling tends to miss feasible projecting direction. To overcome this disadvantage, we propose a general bilinear projection to formulate a pooling module called rank-k factorized bilinear pooling (RK-FBP). Our RK-FBP has three advances: (1) RK-FBP is derived from a general bilinear projection based on complete matrix bases, so no feasible projecting directions will be missed. (2) Because the projecting directions are accurate, the learned bilinear features are not only compact but also discriminative. Those benefits give RK-FBP the power to produce more expressive compact bilinear features. Conducted experiments demonstrate that RK-FBP outperforms various state-of-the-art algorithms on challenging image classification tasks. 

7. DEFINITIONS AND PROOFS

Definition 1: In the linear space R m×n , the inner production between two matrices A ∈ R m×n and B ∈ R m×n is defined as < A, B >= T r(AB T ). Definition 2: {W p } mn p=1 is a group of complete bases of R m×n , if and only if, for any matrix X ∈ R m×n , there is an unique non-zero vector y = [y 1 , y 2 , • • • , y mn ] to hold the following equation: X = mn p=1 y p W p (12) where y p is the projected coefficient of X on the p-th base W p . The vector y can be seen as the representation of the matrix X on the base set {W p } mn p=1 . Definition 3: Suppose W m×n k = {W p ∈ R m×n } mn p=1 with max p {rank(W p )} = k. W m×n k is a complete orthonormal base set of the matrix space R m×n , if the following equation holds : T r(W T p W q ) = 1 p = q 0 p ̸ = q (13) Theorem 1. Suppose U = {u p ∈ R m×1 } l1 p=1 and V = {v q ∈ R n×1 } l2 q=1 are two groups of linear independent vectors in two spaces R m×1 and R n×1 , respectively. If the vector set W = {vec(W i ) ∈ R m×n } mn i=1 is constructed by W i = u p v T q where i = l 1 (q -1) + l 2 , then W is a set of linear independent vectors. Proof. Suppose U = [u 1 , • • • , u l1 ] and V = [v 1 , • • • , v l2 ]. Because vec(W i ) = vec(u p v T q ) = u p ⊗ v q , there is U ⊗ V = [vec(u 1 v T 1 ), vec(u 1 v T 2 ), • • • , vec(u 1 v T l2 ), • • • , vec(u l1 v T 1 ), • • • , vec(u l1 v T l2 ) ]. ⊗ is the Kronecker product. Thus, columns in U⊗V are the vectors in W. According to the property of Kronecker product, there is rank(U ⊗ V) = l 1 l 2 . Thus, W is full rank, and the column in W is a group of linearly independent vectors. □ Theorem 2. If the coefficient of X ∈R m×n on the base u p v T q is calculated as y pq = T r(X T u p v T q ), then the coefficients of X on the whole base set W = {u p v T q |u p ∈ U, v q ∈ V} m,n p=1,q=1 are {y pq } m,n p=1,q=1 which can form a coefficient matrix Y ∈ R m×n satisfying: Y = U T XV ( ) where U = [u 1 , • • • , u m ] ∈ R m×m and V = [v 1 , • • • , v n ] ∈ R n×n . Proof. Because y pq = T r(X T u p v T q ) = u T p Xv q , so by arranging Y pq as a matrix, there is Y = U T XV. □ Corollary. Given a set of linear independent vectors {u i ∈ R m×1 } l i=1 , the symmetry matrices S = { (uiu T j +uj u T i ) 2 ∈ R m×m } l,l i=1,j=1 are also linear independent. Proof. According to Theorem 2, we know the matrices {u i u T j } l,l i=1,j=1 is a set of linear independent matrices. Thus, the solution of following function is c ij = 0. l i=1 l j=1 c ij u i u T j = 0 (15) Let us consider the following equations with variables C ′ = {c ′ ij |i = 1, • • • , l; j = 1, • • • , l; i < j}. l i=1 c ′ ii u i u T i + l i=1 j<i c ′ ij (u i u T j + u j u T i ) 2 = 0 (16) If c ′ ij = 0, we let c ij = c ′ ij /2 for i < j and c ii = c ′ ii . The function in Eq.( 16) has an non-zeros solution. This violates the conclusion that c ij = 0 for Eq.( 16). Thus, all elements in C ′ are 0. It means the matrix set S is a group of linear independent matrices.

8.1. WHY THE BILINEAR POOLING CAN ENHANCE THE DISCRIMINANT ABILITY OF LOCAL FEATURES

Given a local feature x i ∈ R m×1 , its corresponding bilinear feature is z i = vec(x i x T i ). Let us calculate the inner product between bilinear features of x i and x j , there is < z i , z j >= vec(x i x T i ) T vec(x j x T j ) = (x T i x j ) 2 Comparing Eq.( 17) with the polynomial kernel function k(x i , x j ) = (a(x T i x j ) + d) p , the inner product between bilinear features equals to polynomial kernel function with a = 1, d = 0 and p = 2. Therefore, we can claim that the bilinear feature is the explicit result of the non-linear projection determined by the polynomial kernel function. How to let the features outputted by the backbone fit well to the hyper-parameters of polynomial kernel function is not the concern of our paper. Thus, we suppose the backbone can generate features good enough to satisfy the hyper-parameters.

As for the local features

X i = [x i1 , x i2 , • • • , x ic ]×R m×c , the bilinear pooling Z i = X i X T i ∈ R m×m is just the sum of bilinear features of columns in X i . So its enhanced discriminant ability can be also interpreted by the polynomial kernel function. As well known, the polynomial kernel function can improve the classification performance of the support vector machines which is a linear discriminant classifier. Thus, the discriminant information in bilinear features can be well depicted by the linear projection. This is the basis that we can employ linear projections to reduce the dimensionality of bilinear features, which is a crucial step to solve the burstiness problem of bilinear features.

8.2. HOW THE BURSTINESS REDUCE THE PERFORMANCE OF MODELS

Burstiness phenomenon on the image features is first analysed in the literature Jégou et al. (2009) , which focus on the feature vectors obtained by the bag-of-words frameworks. In bag-of-words frameworks, each dimension of feature vectors corresponds to an visual word collected from the whole image data. For the i-th image, the feature vector is x i = [x i1 , x i2 , • • • , x im ] T ∈ R m×1 where the value of x ij is the frequency of the j-th visual word appeared in the i-th image. Suppose the label of i-th image is y i , so the average value of the j-th dimension of the whole c-th class of samples is xj = 1 Nc yi=c x ij where N c is the sample number of the c-th class. In some cases, the j-the visual word may appear a lot of times in the i-th image but does not appear so many times in other images. This will make the value of x ij much larger than the average value of the j-th dimension, i.e., xj . This is the burstiness phenomenon of bag-of-words features. Geometrically, this burstiness phenomenon increases the variance of samples in each class in the feature space. Therefore, literature Wei et al. (2018) describes the burstiness phenomenon as "the problem that the feature descriptor is not invariant enough where the feature elements may have large variances within the same class." Thus, the burstiness problem can be summarized as the problem of large intra-class variance. The burstiness of bilinear features is caused by the outer product on the local features. Let us take the three-dimensional data as an example. Suppose there are three samples .44, 1.44, 1.44, 1.44, 1.44, 1.44, 1.44, 1.44, 1 .44] T . x 1 = [1, 1, 1] T , x 2 = [3, 1, 1], x 3 = [1.2, 1.2, 1.2] T . The bilinear features of x 1 , x 2 , and x 3 are z 1 = vec(x 1 x T 1 ) = [1, 1, 1, 1, 1, 1, 1, 1, 1] T , z 2 = vec(x 2 x T 2 ) = [9, 3, 3, 3, 1, 1, 3, 1, 1, 1] T , and z 3 = vec(x 3 x T 3 ) = [1 We can calculate the average of z 1 and z 3 , i.e., z = z1+z2 2 [1.22, 1.22, 1.22, 1.22, 1.22, 1.22, 1.22, 1.22, 1.22] T . If most of local features are close to x 1 and x 3 , z can be considered as the average bilinear feature of the whole class. Thus, z 2 is far away from the average z. Because this phenomenon is similar to the burstiness of bag-of-words features, it is also called as the burstiness of bilinear features, which also expands the intra-class variance.

=

For some images, there are the illumination variations and appearance changes in them Gao et al. (2020) ; Wei et al. (2018) , which make the features extracted by deep neural networks also have some variances within each class. The variance may reflect the singular values of the matrix storing local features, i.e., X = T i=1 u i σ i v T i where σ i is the i-the singular value and u i and v i are the corresponding singular vectors. Because the bilinear pooling on X is XX T = T i=1 u i σ 2 i u T i . Those variances will be expanded by the outer product and cause the burstiness of bilinear features in the deep frameworks. Because the burstiness will affect the similarity between bilinear features Wei et al. (2018) , it will affect the performance of models based on similarity. For classification tasks, the bilinear features are likely linear discriminant. Due to the high dimensionality of bilinear features, there are a lot of feasible solutions of classifiers can fit the training data well. However, the large intra-class variance caused by the burstiness may let the classifier select a bad solution which has the bad generalization on the test dataset, and the performance of bilinear features is harmed. Thus, how to alleviate the burstiness problem is very important for learning the bilinear features.

8.3. SIGNED ELEMENTWISE SQUARE-ROOT OPERATION

Signed elementwise square-root operation transforms a vector x = [x 1 , x 2 , •, x m ] T to a new vector x = [ x1 , x2 , • • • , xm ] T , in which xi is calculated as xi = sgn(x i ) |x i | Consider the value of |x i |, there is |x i | >= |x i |, |x i | <= 1 |x i | < |x i |, |x i | > 1 In this way, we can find that |x i | let the value |x i | close to 1. Let us consider the value sgn(x i ), Eq.( 18) lets the vector x ∈ R m×1 close to the centers [±1, ±1, • • • , ±1] ∈ R m×1 where the symbol ± is determined by sgn(x i ). Thus, if samples from different classes are well separated and are located in different quadrants in the feature space, the elementwise square-root operation can reduce the intra-class variance well. And the generalization is good. Such a requirement can be satisfied by the bag-of-words features. It is because the bag-of-words frameworks generate feature vectors according the frequency of each visual elements appeared in each image, where the visual elements are often generated by clustering algorithms and thus have explicit similarity meanings. The literature Wei et al. (2018) reveals the features extracted by the deep neural networks do not meet the requirement signed Elementwise Square-root transformation. 8.4 L 2 -NORMALIZATION L 2 -normalization is widely used in the deep neural networks to enhance the generalization ability of learned features. For a vector x i ∈ R m×1 , the L 2 -normalization of x i is xi defined as xi = x i /|x i | 2 Therefore, the Euclidean distance between two xi and xj is |x i -xj | 2 2 = 2 -xT i xj . Thus, after L 2 -normalization, the Euclidean distance between samples can be replaced by the cosine distance. For the features extracted by deep neural networks, L 2 -normalization will reduce the variances in each class Meng et al. (2021) , because the variance information is along the radius direction in the feature space.

8.5. HOW THE FACTORIZED BILINEAR POOLING ALLEVIATE THE BURSTINESS

Because the bilinear features are linear discriminant, so the dimension of bilinear features can be reduced by a set of linear projections. In this way, the variance along those eliminated directions are discarded. Then, the L 2 -normalization strategy is adopted to reduce the variance along the radius direction in the low-dimensional feature space. In this way, the intra-class variances are reduced and the burstiness problem is alleviated.

9. RELATED WORK

We review works improving bilinear pooling in two aspects: 1) Enhancing the bilinear feature's effectiveness; 2) Reducing the dimension of the bilinear feature for greater efficiency. Effectiveness improvement. There are several different routines to improve the bilinear feature's effectiveness. For example, G2DeNet (Wang et al., 2017) , FASON (Dai et al., 2017) and MoNet (Gou et al., 2018) take the first-order statistics into consideration beside the bilinear pooling. KP (Cui et al., 2017) , HOK (Koniusz et al., 2016) and HOP (Cherian et al., 2017) extend the second-order pooling to higher-order pooling. By considering bilinear pooling as depicting correlations across different features in a specific kernel space, Ker-RP (Wang et al., 2015; Zhang et al., 2021) employs kernel functions instead of inner-product to strengthen the representative capability of the bilinear feature. Bilinear pooling can be enhanced by capturing correlations between features yielded by different layers (Yu et al., 2018) . Besides, the strategies of normalization are also proved to be effective in improving the discriminant ability of bilinear features. For instance, improved bilinear pooling (IBP) (Lin & Maji, 2017) , MPN-COV (Li et al., 2018) and their variants (Koniusz et al., 2018) explore the matrix normalization to moderate the singular values of the bilinear matrix. By using those strategies, the discriminant ability of bilinear features is increased by a large margin compared with the original bilinear features. Dimension reduction. The bilinear feature is a high dimensional matrix, which limits its efficiency and makes it prone to over-fitting. To speed up the classification and suppress over-fitting, FBN (Li et al., 2017b) reduces the dimension of the weight matrices in the classifier by replacing each weight matrix with a product of two low-rank matrices. In parallel, some methods focus on the dimension reduction for the original first-order features before bilinear pooling. BCNN (Lin, 2015) uses principal component analysis (PCA) to learn the projection matrix for reducing the dimension of first-order features. Specifically, it uses the projection matrix learned from PCA as initialization of a 1 × 1 convolution layer and then trains the network in an end-to-end manner. The PCA is also utilized in LRBP (Kong & Fowlkes, 2017) and iSQRT-COV (Li et al., 2018) for dimension reduction on the first-order features. Except those linear projection with learnable parameters, CBP (Gao et al., 2016; Yu et al., 2021) approximates the operation of bilinear pooling by the polynomial kernel function. Then CBP employs two kernel approximation methods to reduce the dimension of bilinear feature, tensor sketch and random Maclaurin (RM), which achieve better performance than the PCA. Inspired by CBP, the strategy of CBP is adopted in MLB (Fukui et al., 2016) for cross-modal understanding. In CBP, the projections is formulated to reduce the dimension of the first-order features by two random projections which is then fused by Hadamard product. Hadamard product-based low-rank factorized bilinear pooling (HFBP) (Kim et al., 2016) replaces the random matrices with two learnable matrices and obtains a new compact bilinear pooling algorithm. Such technique is then adopted by HBP (Yu et al., 2018) for fusing features in different layers. Although so many dimension reduction algorithms have been proposed in past years, there lacks a general perspective to understand them. In this paper, we review the dimension reduction algorithms from the perspective of finding appropriate projection directions. Moreover, we reveal that the projections used in those algorithms tend to miss a lot of possible projecting directions, which reduces the effectiveness of the compact bilinear feature. As well known, the polynomial kernel function can be adopted to solve the inseparable problem by support vector machines (SVMs) and k-means algorithms. The ability to solve the inseparable problem also enhance the discriminant ability of vectors x. Let us adopt the spectral graph partitioning as a tool to analyze the performance of bilinear features. Before doing this, we introduce the procedure of spectral graph partitioning. Definition 4. Given a set of samples {x i } N i=1 , the adjacent matrix S is defined by S ij = K(x i , x j ), i ̸ = j 0, i = j (21) where K(x i , x j ) is a kernel function whose value monotonically decreases with respective to the distance between x and y, e.g., K(x, y) = exp(-γ|x -y| 2 2 ) (γ > 0). The normalized Laplace matrix is defined as L = I -D -1/2 SD -1/2 where D is a diagonal matrix with D ii = j=1 S ij . The following equation is held: f T Lf = N i=1 N j=1 S ij ( f i √ d i - f j d j ) 2 where f = [f 1 , f 2 , • • • , f N ] T ∈ R N ×1 . For traditional bi-class spectral graph partitioning, the objective function is to minimize Normalized cuts whose the objective function is presented as follows. min f∈{-1,1} N ×1 N i=1 N j=1 S ij | f i √ d i - f j d j | 2 2 Because the distance between x i and x j is smaller, the value of k(x i , x j ) is larger, the solution in the above optimization problem is where C i is the collection of samples in the i-th class. f i = √ d i , x i ∈ C 1 - √ d i , x i ∈ C 2 (23) 𝐶 1 𝐶 2 𝐶 3 𝐶 4 𝐶 1 𝐶 2 𝐶 3 𝐶 4 𝐶 1 𝐶 2 𝐶 3 𝐶 4 𝐶 1 𝐶 2 𝐶 3 𝐶 4 (a) (b) The above is the traditional graph partitioning model. In our graph partitioning model, we let the kernel function k(x i , x j ) whose value monotonically increases with respective to the distance between x i and y j . Thus, the objective function of our spectral clustering can be formulated as follows. max f∈{-1,1} N ×1 N i=1 N j=1 S ij | f i √ d i - f j d j | 2 2 (24) Because the task only has two classes, we can obtain the solution presented in Eq.( 23) by maximizing the objective function. However, the two graph partitioning algorithms presented in Eq.( 23) and Eq.( 24) behavior quite different when they are employed to solve the multi-class tasks. To clearly show the difference, we suppose there are 4-classes of samples. The i-th class is denoted as C i , and the samples from the same class are listed together. The kernel matrix S of two methods are graphically shown in Figure 7 . Both methods consider the f T Lf which equals sum of values in the dark-colored boxes in Figure 7 . Because the tradition model minimizes the value of f T Lf, so it tends to find dark-colored boxes consisting smaller areas. As for our model, it maximizes f T Lf, so it finds the dark-colored boxes having the largest area. In this way, the two classes obtained by the traditional graph partitioning model are {C 1 } and {C 2 , C 3 , C 4 }, which indicates that it separates one class of samples from the reset classes. When we want to separate the 4 classes, we need 4 vectors {f i } 4 i=1 to indicates the class-membership of samples. However, as for our model, the obtained classes consist of samples in {C 1 , C 2 } and {C 3 , C 4 }, which is not consistent to the ground-truth of samples. But, if we further separate C 1 from C 2 , and C 3 from C 4 , i.e., obtain another partition {C 1 , C 3 } and {C 2 , C 4 }, we can partition the data set well. Such a partitioning result corresponds to f  1 = [ √ d 1 , √ d 2 , - √ d 3 , - √ d 4 ] and f 2 = [ √ d 1 , - √ d 2 , √ d 3 , - √ d 4 ]. Obviously, the two dimensional samples [ √ d 1 , √ d 1 ], [ √ d 2 , - √ d 2 ], [- √ d 3 , √ d 3 ], [- √ d 4 , √ F T F=D k C k=1 N i=1 N j=1 S ij | f k i √ d i - f k j d j | 2 2 ⇐⇒ max F T F=D k T r(F T (I -D -1 2 SD -1 2 )F) ⇐⇒ max F T F=D k T r(F T (-D -1/2 SD -1/2 )F) ⇐⇒ max F T F=D k ∥(D -1/2 SD -1/2 ) -FF T ∥ 2 F ( ) where k = log 2 (C). Remark 5. The solution of F in Eq.( 25) is the log 2 (C)-th smallest eign-vectors of D -1/2 SD -1/2 . Lemma. If the sample N is large enough, the eigen-vectors of D -1/2 SD -1/2 are equal to the eigenvectors of S. Thus, the embedding F can be obtained by performing eign-value decomposition on S. Proof: Because ii-th element in D -1/2 is d i = 1 N j=1 Sij . Because, the S ij is the value of k(x i , x j ) which is monotonically increase with distance between x i and x j . If N is large enough, the error ∥d i -d j ∥ < ϵ. Under this assumption, there is that d 1 ≈ d 2 ≈ d 3 ≈ • • • ≈ d N . Therefore, we can let D -1/2 SD -1/2 ≈ (d i ) 2 S. Thus, we have that the embedding F can be approximated by the find the log 2 (C)-th smallest eign-vectors S. Theorem 3. When X is the bilinear features, S = X T X becomes the similarity matrix determined by the second-order polynomial kernel function. Let us denote F as the solution in Eq.( 25), thus there exists a linear projection L to let F = L T X. Proof. Because S = r i=1 σ i u i u T i , where σ i is the i-th small eigen-value of S, thus, k i=1 σ i u i u T i = FF T . Because S = X T X, there is X = r i=1 √ σ i u i v T i , there is k i=1 √ σ i v T i = [u 1 , u 2 , • • • , u k ] T X = F. Remark. The above theorem prove that the bilinear features can be reduced to an extremely low dimensional space with there discriminant information being preserved. This is the theoretical base why our method can reduce the bilinear features to 512.

10. RELATION WITH EXISTING METHODS

10.1 PCA BASELINE PCA baseline is exploited in BCNN and iSQRT-COV for dimension reduction for bilinear features. It adopts a 1 × 1 convolutional layer to reduce the dimension of the local feature from m to l 1 (l 1 < m). Then the bilinear pooling is performed on the l 1 dimensional features to l 2 1 -dimensional feature. To be specific, given N local features X = [x 1 , x 2 , • • • , x N ] ∈ R m×N , PCA baseline generate the compact local features by Y = U T X. Thus, the compact bilinear feature is obtained by Z = U T X XU According to the theorem 1, the Eq.( 26) actually employs a set of rank 1 matrix bases to reduce the high-dimensional bilinear feature X X to a compact one Z. According to our analysis, if the data prefers large rank matrix bases, some discriminant information is lost in the dimension reduction procedure. To overcome this shortcoming, we should employ k projecting matrices {U t } k t=1 to reduce the local feature X and calculate the compact bilinear feature Z as follows: Z = k t=1 U T t X XU t (27) 10.2 RANDOM MACLAURIN (RM) RM is employed in CBP (Gao et al., 2016) for compact bilinear pooling. inspired by the success of CBP, many RM-based algorithms are proposed (Yu et al., 2021; Fukui et al., 2016) . RM employed two random projecting matrices U and V to reduce the dimension of bilinear features. z = N i=1 U T x i • U T x i ( ) where z is the compact bilinear feature, and • is the Hadamard product. Let us compare Eq.( 28) with Eq.(4). If we set P as an identity matrix, Eq.( 28) with Eq.( 4) are the same. The difference is that U and V in Eq.( 4) are updated by gradient descent algorithms while those in Eq.( 28) are random variables found by sampling values from the random distributions. It may be why methods of Eq.( 4) outperform methods based on RM in terms of classification accuracy. However, because the parameters in Eq.( 4) do not need to update via gradient descent algorithm, it involves less computation.

10.3. TWO-LEVEL KRONECKER-PRODUCT PRODUCT FACTORIZATION (TKPF)

TKPF supposes a projecting matrix P can be decomposed as the Kronecker product presented as follows. P = Q q=1 A (q) ⊗ B (q) Then, by further decomposing A (q) and B (q) as A q = I r ⊗ Â(q) and B (q) = I r ⊗ B(q) , TKPF formulates the compact bilinear pooling as Z = Q q=1 (I r ⊗ ( B(q) ) T )XX T (I r ⊗ Â(q) ) where Â (q) and B (q) are learnable parameters. Because the scale of Â (q) and B (q) can be adjusted by the parameter r, TKPF can use very less parameters to reduce the dimension of bilinear feature. However, we can find the TKPF has the following shortcoming. where U = [U 1 , • • • , U h ] ∈ R m×hk and V = [V 1 , • • • , V h ] ∈ R n×hk are the learnable parameters of the dictionary. P ∈ R h×hk is a fixed binary matrix with only elements in the row l, columns ((l -1) × h) + 1 to (lh) being "1", where l ∈ [1, h]. Because the above formulation involves the matrix inverse operation, it adopts a relaxation strategy ((U T UP T • V T VP T )) -1 P(U T x s • V T y t ) = ( ÛT x s • VT y t ). f ′ = P( ÛT x s • VT y t ) f = sign(f ′ i ) • max(abs(f ′ i ) -λ 2 , 0) However, we can prove that the coding model is mathematically equivalent to the traditional model. And we can propose that a new coding-based model does not need relaxation. We consider the vector-based coding model min y ∥x -W T y∥ 2 F where y is the coefficient on of vector x on the dictionary W. There is a solution y = (WW T ) -1 Wx. According to FBC, when we replace the x as vec(xy T t ), the i-th column of W as V i U T i , and y as f, we can obtain the formulation Eq.( 35). Similarly, we can construct another coding-model min y ∥x-W T (WW T ) -1 y∥ 2 F . y is the coefficient of x on the atoms W(W T W) -1 . There is the solution y = W T x which is a linear projection. Thus, according to the Eq.( 36), if we replace W' the i-th column w i as V i U T i , the matrix-based dictionary is P((U T UP T • V T VP T )) -1 PR T . The i-th column of R is u i ⊗ v i where u i and v i are the i-th column of U and V. This is because R T vec(x s y T t ) = U T x s • V T y t . Because u i ⊗ v i = vec(u i v T i ), so P((U T UP T • V T VP T )) -1 PR T can also be transformed to a set of matrices. Thus, we have another type of matrix-based coding model, and the bilinear feature can be outputted by f = P T (U T x s •V T y t ) which is equivalent to the formulation Eq.( 37). And the solution is an accurate one. This is why we directly derive the bilinear pooling model from our general bilinear projection f = P T vec(U T x s y T t V) other than the coding based framework.

10.5. FORMULATION OF RK-HFBP

Eq.( 4) has a matrix P ∈ R l×h , so l can not be very large. Thus, Eq.( 4) is a rank-1 Hadamard product-based FBiP. As discussed in FBC, the rank of projecting matrices are also very important. So we improve Eq.( 4) by giving its projecting matrices more rank. By employing the same strategy of our proposed rank-k bilinear projection Z = U T (I k ⊗ X)V = k i=1 U T i XV i , we obtain a new rank-k Hadamard product-based bilinear pooling (RK-HFBiP) presented as follows. y = P T vec( k i=1 U T i x s • V i y T t ) We will RK-HFBiP as a comparison method in our ablation experiments. Number of orders. We can nest several RK-FBP modules together to learn compact representations with high-order information, which we denoted as RK-FBP-M. We evaluate the influence of the order T on RK-FBP-M. We set l 1 = l 2 = h = 512, and vary d among {2, 3, 4}. Because our model allows us to concatenate the features with different orders, we have 7 types of features. As seen from Table 3 , the multi-linear features win the bilinear features by a significant margin. Besides, with the increase of the order, the result is not always increased. To be specific, the (2, 3), (2, 4), and (2, 3, 4) features achieve the similar accuracy. Thus, in our paper, we set the order parameter as (2, 3) for our the multi-linear model.

11. ABLATION EXPERIMENTS

Normalization Strategy. In the fully bilinear pooling approaches, the performance is crucially dependent on the normalization strategy. Thus, we explore how the normalization strategies affect our proposed RK-FBP. Thus, we add two normalization strategies adopted by fully bilinear pooling Lin (2015) and improved bilinear pooling Lin & Maji (2017) before RK-FBP modules. We denote them by 'SgnSqrt' normalization and 'SgnSqrt+log' normalization, respectively. We also compare the result with only 'log' normalization. At last, all bilinear features should be normalized to an 'unit' vector by L 2 normalization. The results are shown in the Table 4 . As seen from the results, we can find that the results are similar with little variance. This is a bit different from Table 1 , which employs normalization strategies after projection. In Table 1 the combination of feature reduction and normalization strategies increases the performance. This may be because the features after projection is still high, there are still much information of large intra-class variance. So the normalization strategies can improve the performance of models by removing the bad information. Because our RK-FBP does not use 'SgnSqrt' and 'log' but achieves comparable results with those methods using normalization strategies, it indicates our proposed model can solve the burstiness problem caused by large intra-class variances. 



Figure 1: (a): Original data {xi}; (b): The results projected by xi = [0.55, -0.45; -0.45, 0.55]xi; (c): L2normalization on xi: zi = xi/| xi|2; (d): Element-wise "signed-square-root": xi = sgn(xi) √ xi. Samples in regions 1,2,3,4 shrink to points [1, 1], [-1, 1],[-1, -1] and [1, -1], respectively. (e)(f): L2-normalization on xi and xi, respectively.Compared (e) with (f), we know the intra-class variances in region 2 and 4 are reduced but those in 1 and 3 are increased in (e). Thus, an accurate linear projection is better than "signed-square-root" strategy for solving the burstiness problem.

Figure 2: Two ways to connect dimensions of U T xs and V T y t . (a) is Hadamard product; (b) is Kronecker product.

Figure 3: The angles between columns of U(2048 × 2048) trained by Eq.(4). Because the angle range is about [85 • , 95 • ], the columns are nearly vertical to each other.

m p=1 a rp u p and v r = m q=1 b rq v q , respectively, where a rp and b rq are auxiliary parameters introduced for easy reading. Obviously, u r and v r can generate a projecting direction vec(u r v T r rp b rq )vec(u p v T q ). Let us list those generated projecting vectors of different r into a projecting matrix, i.e., L = [vec(

DECOMPOSITION OF MATRIX BASES Given a set of rank-k matrices W m×n k = {W p ∈ R m×n } mn p=1 , if those matrices are linear independent, then W m×n k is a rank-k complete base set of the matrix space R m×n .

Figure 4: (a) The proposed multi-linear pooling structure. (b) RK-FBP module. In (a), we use three Rk-FBP

Under review as a conference paper at ICLR 2023 5.2 ABLATION EXPERIMENT We adopt VGG16 as the backbone for the ablation experiments. Rank k. Different k means different matrix bases, the compact bilinear features also have different performance. To demonstrate how the rank k affects Rk-FBP, we perform RK-FBP on Indoor data set with varied k ∈ [1, • • • , 15]. The number of column and row projections are set as l 1 = l 2 = l ∈ {50, 100, 150, 200, 250} and the dimension of the bilinear feature is set to h = 512. The results are depicted in Figure.

Figure 5: Accuracy of RK-FBP with different Rank k on Indoor (dimension h = 512). When l is small, k should be large to achieve its largest accuracy.

Figure 6: Accuracy of RK-FBP with different dimensions on Indoor. FBC and RK-HFBP need large dimensions to achieve the best result.

INTERPRETATION OF BILINEAR FEATURES FROM THE PERSPECTIVE OF SUPERVISED SPECTRAL GRAPH PARTITIONING 9.1.1 SPECTRAL GRAPH PARTITIONINGLet us consider the similarity between two samples x and y with polynomial kernel function K(x, y) = (γ(x T y) + d) 2 where γ and d are two hyper-parameters adjusted for different data distributions. If we calculate the inner product between two bilinear features vec(xx T ) and vec(y T y T ): < (xx T , yy T >= (xy T ) 2 , we can find that it is the special case of the similarity calculated by the polynomial kernel function with γ = 1 and d = 0. The similarity matrix S on a data set can be used to describe the discriminant information between

Figure 7: (a) Traditional spectral graph partitioning. (b) Our designed spectral graph partitioning. There are four classes {Ci} 4 i=1 . When we use the bi-class based model to deal with the fourth classes, the partitioning of two models are different. In (a), the first class consists of C1, and the second class consists of {C2, C3, C4}; in (b), the first class consists of {C1, C2}, and the second class consists of {C3, C4}. When we extend the bi-class model to multi-class version, the strategy is different.

Figure 8: (a) Traditional spectral graph partitioning. (b) Our designed spectral graph partitioning. The traditional graph partition method needs C dimension to code C classes of samples, while our method can only use log2(C) dimension to code C classes of samples. Remark 5. Our proposed graph partitioning algorithm can only employ log 2 (C) dimensions to represent the embeddings of samples. Theoretically, we can employ 64-dimension to code 2 64 classes of samples. The geometrical illustration of samples obtained by two types of graph partitioning methods are shown in the Figure 8. Now, we employ our proposed graph partitioning algorithm to extract the features. Our mult-class graph partitioning model is formulated as

Accuracy (%) with different rank k of the projection on different datasets.

Comparisons for BiP methods in terms of Average Precision (%)

TanYu, Xiaoyun Li, and Ping Li. Fast and compact  bilinear pooling by shifted random maclaurin. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 3243-3251, 2021.

Accuracy (%) with different orders on variance datasets. (a, • • • , c) means the feature is constructed by concatenating the a-th,• • • , c-th order statistic information.

Accuracy (%) with normalization strategies on variance datasets.

annex

TKPF assumes that every matrix P can be decomposed as Eq.( 29). However, this assumption is not true. If we construct the i-th column of P asCompared Eq.( 31) with Eq.( 32), we knowcan not be decomposed as Eq.( 29).Besides,is also a Kronecker product-based decomposition, which may be not hold. This means KTPF ignores a lot of feasible projecting directions. If the data prefers those missed feasible projecting directions, the performance is compact is bad.Let us compare TKPF with our bilinear model. For easy comparison, we transform our bilinear projection y = P T vec( k i=1 U T i XV i ) into a vector-based form presented as follows.Then, the projecting matrix in Eq.( 33) isObviously, Eq.( 29) and Eq.( 34) look similar. The difference between them is that Eq.( 33) has a matrix L while Eq.( 29) does not. L plays the role to select the columns in V T q ⊗ U T q to form a matrix can not be decomposed by Kronecker product. This makes our P can be any matrix. Therefore, the minimal difference makes our proposed projection is an accurate one while the projection in Eq.( 29) is not.Our proposed bilinear projection is general, it can be used to analyze the performance of other dimension reduction algorithms for matrix data, such as two-dimensional principal analysis (Zhang & Ren, 2011) and two-dimensional linear discriminant analysis (Ye et al., 2004) .As discussed above, TKPF employs some inappropriate matrix decompositions to construct the projection, which means TKPF also can not find accurate projecting matrices. Although the performance of TKPF looks good on its reported datasets, TKPF may suffer from a great performance reduction in other applications. Thus, the application range of TKPF is limited. At last, compared with our proposed bilinear model, the dimension of the compact bilinear feature is still high, e.g., for its best accuracy, the dimension is 96 * 96. where λ is a trade-off between the reconstruction error and the sparsity. U l ∈ R m×k and V l ∈ R n×k are two rank-k matrices decomposed from the l-th rank-k matrix atom. Here, U l and V l are learned by the deep model through the whole set of original samples. The optimization problem in Eq.( 35) has a closed-form solution presented as follows:

