ATTENTIONAL CONSTELLATION NETS FOR FEW-SHOT LEARNING

Abstract

The success of deep convolutional neural networks builds on top of the learning of effective convolution operations, capturing a hierarchy of structured features via filtering, activation, and pooling. However, the explicit structured features, e.g. object parts, are not expressive in the existing CNN frameworks. In this paper, we tackle the few-shot learning problem and make an effort to enhance structured features by expanding CNNs with a constellation model, which performs cell feature clustering and encoding with a dense part representation; the relationships among the cell features are further modeled by an attention mechanism. With the additional constellation branch to increase the awareness of object parts, our method is able to attain the advantages of the CNNs while making the overall internal representations more robust in the few-shot learning setting. Our approach attains a significant improvement over the existing methods in few-shot learning on the CIFAR-FS, FC100, and mini-ImageNet benchmarks. * indicates equal contribution Parts models are extensively studied in fine-grained image classifications and object detection to provide spatial guidance for filtering uninformative object proposals (

1. INTRODUCTION

Tremendous progress has been made in both the development and the applications of the deep convolutional neural networks (CNNs) (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016; Xie et al., 2017) . Visualization of the internal CNN structure trained on e.g. ImageNet (Deng et al., 2009) has revealed the increasing level of semantic relevance for the learned convolution kernels/filters to the semantics of the object classes, displaying bar/edge like patterns in the early layers, object parts in the middle layers, and face/object like patterns in the higher layers (Zeiler & Fergus, 2014) . In general, we consider the learned convolution kernels being somewhat implicit about the underlying objects since they represent projections/mappings for the input but without the explicit knowledge about the parts in terms of their numbers, distributions, and spatial configurations. On the other hand, there has been a rich history about explicit object representations starting from deformable templates (Yuille et al., 1992) , pictorial structure (Felzenszwalb & Huttenlocher, 2005) , constellation models (Weber et al., 2000; Fergus et al., 2003; Sudderth et al., 2005; Fei-Fei et al., 2006) , and grammar-based model (Zhu & Mumford, 2007) . These part-based models (Weber et al., 2000; Felzenszwalb & Huttenlocher, 2005; Fergus et al., 2003; Sudderth et al., 2005; Zhu & Mumford, 2007) share three common properties in the algorithm design: (1) unsupervised learning, (2) explicit clustering to obtain the parts, and (3) modeling to characterize the spatial configuration of the parts. Compared to the CNN architectures, these methods are expressive with explicit part-based representation. They have pointed to a promising direction for object recognition, albeit a lack of strong practice performance on the modern datasets. Another line of object recognition system with the part concept but trained discriminatively includes the discriminative trained part-based model (DPM) (Felzenszwalb et al., 2009) and the spatial pyramid matching method (SPM) (Lazebnik et al., 2006) . In the context of deep learning, efforts exist to bring the explicit part representation into deep hierarchical structures (Salakhutdinov et al., 2012) . The implicit and explicit feature representations could share mutual benefits, especially in fewshot learning where training data is scarce: CNNs may face difficulty in learning a generalized representation due to lack of sufficient training data, whereas clustering and dictionary learning provide a direct means for data abstraction. In general, end-to-end learning of both the implicit and explicit part-based representations is a viable and valuable means in machine learning. We view convolutional features as an implicit part-based representation since they are learned through back-propagation via filtering processes. On the other hand, an explicit representation can be attained by introducing feature clustering that captures the data abstraction/distribution under a mixture model. In this paper, we develop an end-to-end framework to combine the implicit and explicit part-based representations for the few-shot classification task by seamlessly integrating constellation models with convolution operations. In addition to keeping a standard CNN architecture, we also employ a cell feature clustering module to encode the potential object parts. This procedure is similar to the clustering/codebook learning for appearance in the constellation model (Weber et al., 2000) . The cell feature clustering process generates a dense distance map. We further model the relations for the cells using a self-attention mechanism, resembling the spatial configuration design in the constellation model (Weber et al., 2000) . Thus, we name our method constellation networks (ConstellationNet). We demonstrate the effectiveness of our approach on standard few-shot benchmarks, including FC100 (Oreshkin et al., 2018) , CIFAR-FS (Bertinetto et al., 2018) and mini-ImageNet (Vinyals et al., 2016) by showing a significant improvement over the existing methods. An ablation study also demonstrates the effectiveness of ConstellationNet is not achieved by simply increasing the model complexity using e.g. more convolution channels or deeper and wider convolution layers (WRN-28-10 (Zagoruyko & Komodakis, 2016 )) (see ablation study in Table 3 and Figure 2 (e) ).

2. RELATED WORK

Few-Shot Learning. Recently, few-shot learning attracts much attention in the deep learning community (Snell et al., 2017; Lee et al., 2019) . Current few-shot learning is typically formulated as a meta-learning problem (Finn et al., 2017) , in which an effective feature embedding is learned for generalization across novel tasks. We broadly divide the existing few-shot learning approaches into three categories: (1) Gradient-based methods optimize feature embedding with gradient descent during meta-test stage (Finn et al., 2017; Bertinetto et al., 2018; Lee et al., 2019) . ( 2) Metric-based methods learn a fixed optimal embedding with a distance-based prediction rule (Vinyals et al., 2016; Snell et al., 2017) . (3) Model-based methods obtains a conditional feature embedding via a weight predictor (Mishra et al., 2017; Munkhdalai et al., 2017) . Here we adopt ProtoNet (Snell et al., 2017) , a popular metric-based framework, in our approach and boost the generalization ability of the feature embeddings with explicit structured representations from the constellation model. Recently, Tokmakov et al. (2019) proposes a compositional regularization to the image with its attribute annotations, which is different from out unsupervised part-discovery strategy. Part-Based Constellation/Discriminative Models. The constellation model family (Weber et al., 2000; Felzenszwalb & Huttenlocher, 2005; Fergus et al., 2003; Sudderth et al., 2005; Fei-Fei et al., 2006; Zhu & Mumford, 2007) is mostly generative/expressive that shares two commonalities in the representation: (1) clustering/codebook learning in the appearance and (2) modeling of the spatial configurations. The key difference among these approaches lies in how the spatial configuration is modeled: Gaussian distributions (Weber et al., 2000) ; pictorial structure (Felzenszwalb & Huttenlocher, 2005) ; joint shape model (Fergus et al., 2003) ; hierarchical graphical model (Sudderth et al., 2005) ; grammar-based (Zhu & Mumford, 2007) . These constellation models represent a promising direction for object recognition but are not practical competitive compared with deep learning based approaches. There are also discriminative models: The discriminatively trained part-based model (DPM) (Felzenszwalb et al., 2009 ) is a typical method in this vein where object parts (as HOG features (Dalal & Triggs, 2005) ) and their configurations (a star model) are learned jointly in a discriminative way. The spatial pyramid matching method (SPM) (Lazebnik et al., 2006) has no explicit parts but instead builds on top of different levels of grids with codebook learned on top of the SIFT features (Lowe, 2004) . DPM and SPM are of practical significance for object detection and recognition. In our approach, we implement the constellation model with cell feature clustering and attention-based cell relation modeling to demonstrate the appearance learning and spatial configuration respectively.

