COMPACT BILINEAR POOLING VIA GENERAL BILIN-EAR PROJECTION

Abstract

Many factorized bilinear pooling (FBiP) algorithms employ Hadamard productbased bilinear projection to learn appropriate projecting directions to reduce the dimension of bilinear features. However, in this paper, we reveal that the Hadamard product-based bilinear projection makes FBiP miss a lot of possible projecting directions, which will significantly harm the performance of outputted compact bilinear features, including compactness and effectiveness. To address this issue, we propose a general matrix-based bilinear projection based on the rank-k matrix base decomposition, where the Hadamard-based bilinear projection and Y = U T XV are special cases of our proposed one. Thus, our proposed projection can be used improve the algorithms based on the two types of bilinear projections. Using the proposed bilinear projection, we design a novel low-rank factorized bilinear pooling (named RK-FBP), which considers the feasible projecting directions missed by the Hadamard product-based bilinear projection. Thus, our RK-FBP can generate better compact bilinear features. To leverage high-order information in local features, we nest several RK-FBP modules together to formulate a multi-linear pooling that outputs compact multi-linear features. At last, we conduct experiments on several fine-grained image tasks to evaluate our models, which show that our models achieve new state-of-the-art classification accuracy by the lowest dimension.

1. INTRODUCTION

Bilinear pooling (BiP) (Lin, 2015) and its variants (Li et al., 2017b; Lin & Maji, 2017; Wang et al., 2017) employ Kronecker product to yield expressive representations by mining the rich statistical information from a set of local features, and has attracted wide attentions in many applications, such as fine-grained image classification, visual question answering, etc. Although achieving excellent performance, the bilinear features suffer from two shortcomings: (1) the ability of BiP to boost the discriminant information between different classes also magnifies the intra-class variances of representations, which makes BiP easily encounter the burstiness problem (Gao et al., 2020; Zheng et al., 2019) and suffer from a performance deficit; (2) the Kronecker product exploited by BiP usually makes the bilinear features exceptionally high-dimensional, leading to an overfitted training of the succeeding tasks and a hefty computational load by increasing the memory storage. Thus, how to effectively solve the shortcomings of BiP is an important issue. Several approaches have been proposed (Gao et al., 2016; Fukui et al., 2016; Yu et al., 2021; Li et al., 2017b; Kim et al., 2016) to solve the shortcomings of BiP. Among them, the factorized bilinear pooling (FBiP) methods (Li et al., 2017b; Kim et al., 2016; Amin et al., 2020; Yu et al., 2018; Gao et al., 2020) have been promising leads. The essence of FBiP performs a dimension reduction operation on bilinear features. It finds a linear projection to map bilinear features into a low-dimension space with their discriminant information among classes preserved using the least dimensions, and then employs L 2 -normalization to project those low-dimension features on a hyper-sphere. (Bilinear pooling equals the non-linear projection determined by the polynomial kernel function k(x, y) = (< x, y >) 2 Gao et al. ( 2016), which makes bilinear features probably linear discriminant.) Thus, the information reflecting the large intra-class variances is abandoned because they do not help distinguish different classes. In this way, the burstiness and high dimension problems are solved simultaneously (Gao et al., 2020; Wei et al., 2018) . This procedure is depicted in Figure 1 . The sub-figure (a) shows a set of samples with a large variance. Sub-figure (c) is the result after the linear projection and L 2 -normalization, the intra-class variance is reduced significantly. Actually, the information along e 1 can be completely discarded, and the dimension of data becomes 1. Let us compare sub-figure (c) with (e), we can find FBiP outperforms the combination of the signed-square-root transformation and L 2 -normalization . To achieve such a good performance, the key step is to find the accurate projecting direction e 2 , or the dimension and intra-variance reduction can not achieve successfully. Because of the extremely high dimension of bilinear features, it is not easy to find appropriate projecting directions by traditional linear projection. Thus, how to find a set of parameter-efficient model to accurately depict those appropriate directions is curial for solving the shortcomings of bilinear features. Most FBiP approaches formulate the projection for dimension reduction as a Hadamard product-based bilinear projection (Kim et al., 2016; Gao et al., 2020; Li et al., 2017b) : (a) (b) (c) (d) (e) 2 4 1 3 𝑒 1 𝑒 2 𝑒 4 𝑒 3 1 2 3 (f) 𝛿 𝛿 𝑒 2 𝑒 1 𝑒 1 𝑒 2 𝒖1 𝑇 𝒙𝒔 𝒖2 𝑇 𝒙𝒔 𝒖3 𝑇 𝒙𝒔 𝒖4 𝑇 𝒙𝒔 𝒖 5 𝑇 𝒙𝒔 ∘ 𝒗1 𝑇 𝒚𝒕 𝒗2 𝑇 𝒚𝒕 𝒗3 𝑇 𝒚𝒕 𝒗4 𝑇 𝒚𝒕 𝒗 5 𝑇 𝒚𝒕 = (𝒖1 𝑇 𝒙𝒔)(𝒗1 𝑇 𝒚𝒕) (𝒖2 𝑇 𝒙𝒔)(𝒗2 𝑇 𝒚𝒕)) (𝒖3 𝑇 𝒙𝒔)(𝒗3 𝑇 𝒚𝒕)) (𝒖4 𝑇 𝒙𝒔)(𝒗4 𝑇 𝒚𝒕)) (𝒖 5 𝑇 𝒙𝒔)(𝒗 5 𝑇 𝒚𝒕)) (𝑼 𝑻 𝒙𝒔) ∘ 𝑽 𝑻 𝒚𝒕 = 𝒛 (𝑼 𝑻 𝒙𝒔)⨂ 𝑽 𝑻 𝒚𝒕 = 𝒛 𝒖1 𝑇 𝒙𝒔 𝒖2 𝑇 𝒙𝒔 𝒖 3 𝑇 𝒙𝒔 𝒖4 𝑇 𝒙𝒔 𝒖 5 𝑇 𝒙𝒔 ⨂ 𝒗1 𝑇 𝒚𝒕 𝒗2 𝑇 𝒚𝒕 𝒗 3 𝑇 𝒚𝒕 𝒗4 𝑇 𝒚𝒕 𝒗 5 𝑇 𝒚𝒕 = (𝒖1 𝑇 𝒙𝒔)(𝒗1 𝑇 𝒚𝒕)) (𝒖1 𝑇 𝒙𝒔)(𝒗2 𝑇 𝒚𝒕)) ⋮ (𝒖2 𝑇 𝒙𝒔)(𝒗1 𝑇 𝒚𝒕)) (𝒖2 𝑇 𝒙𝒔)(𝒗2 𝑇 𝒚𝒕)) ⋮ (𝒖5 𝑇 𝒙𝒔)(𝒗4 𝑇 𝒚𝒕)) (𝒖5 𝑇 𝒙𝒔)(𝒗5 𝑇 𝒚𝒕)) (a) (b) f = P T (U T x s • V T y t ) where U and V are two learnable variables, respectively, and P is a variable defined various from algorithms. Each dimension of f can be seen as a linear combination of dimensions of the Hadamard product z = (U T x s • V T y t ). Consider each dimension of z shown in Figure 2 . The Hadamard product only considers the values {(u T i x s )(y T t v j )|i = j)} and ignores values {(u T i x s )(y T t v j )|i ̸ = j} which are considered by Kronecker product. In this paper, we prove that those ignored values are important for dimension reduction, because they are coefficients of bilinear features on feasible matrix projecting directions {v i u T j |i ̸ = j}. Thus, missing them will lead to inaccurate projecting directions of the dimension reduction, which inevitably affects the overcoming for shortcomings of BiP greatly. Consequently, the effectiveness and compactness of FBiP are seriously harmed. In this paper, from the perspective of finding accurate and parameter-efficient matrix projecting directions, we analyze the decomposition on bases of a matrix space, then propose a general bilinear projection based on decomposed rank-k matrix bases. Because of the solid mathematical foundation of the proposed bilinear projection, it can be seen as a baseline to analyze current FBiP. Employing our novel bilinear projection, we formulate a new FBiP model without missing possible projecting directions. The contributions are listed as follows: (1) We make a detailed analysis to demonstrate why the traditional FBiP tends to miss a lot of feasible projecting directions. Based on our analysis, we propose a general bilinear projection that calculates the coefficients of matrix data on a set of complete decomposed rank-k matrix bases. (2) Based on the proposed general bilinear projection, we design a new FBiP method named rank-k factorized bilinear pooling (RK-FBP). Because of the capability to learn accurate projecting directions, the calculated bilinear features are highly compact and effective. Utilizing this property, we nest several RK-FBP modules to calculate multi-linear features that are still compact and effective. (3) We conduct experiments on several challenging image classification datasets to demonstrate the effectiveness of the proposed RK-FBP. Compared with state-of-the-art BiP methods, our model can output extremely compact bilinear features (the dimension is 512) and achieve comparable or better classification accuracy.



Figure 1: (a): Original data {xi}; (b): The results projected by xi = [0.55, -0.45; -0.45, 0.55]xi; (c): L2normalization on xi: zi = xi/| xi|2; (d): Element-wise "signed-square-root": xi = sgn(xi) √ xi. Samples in regions 1,2,3,4 shrink to points [1, 1], [-1, 1],[-1, -1] and [1, -1], respectively. (e)(f): L2-normalization on xi and xi, respectively.Compared (e) with (f), we know the intra-class variances in region 2 and 4 are reduced but those in 1 and 3 are increased in (e). Thus, an accurate linear projection is better than "signed-square-root" strategy for solving the burstiness problem.

Figure 2: Two ways to connect dimensions of U T xs and V T y t . (a) is Hadamard product; (b) is Kronecker product.

