MULTILAYER DENSE CONNECTIONS FOR HIERARCHI-CAL CONCEPT PREDICTION

Abstract

Classification is a pivotal function for many computer vision tasks such as image recognition, object detection, scene segmentation. Multinomial logistic regression with a single final layer of dense connections has become the ubiquitous technique for CNN-based classification. While these classifiers project a mapping between the input and a set of output category classes, they do not typically yield a comprehensive description of the category. In particular, when a CNN based image classifier correctly identifies the image of a Chimpanzee, its output does not clarify that Chimpanzee is a member of Primate, Mammal, Chordate families and a living thing. We propose a multilayer dense connectivity for a CNN to simultaneously predict the category and its conceptual superclasses in hierarchical order. We experimentally demonstrate that our proposed dense connections, in conjunction with popular convolutional feature layers, can learn to predict the conceptual classes with minimal increase in network size while maintaining the categorical classification accuracy.

1. INTRODUCTION

Classification is a core concept for numerous computer vision tasks. Given the convolutional features, different architectures classify either the image itself (He et al., 2015; Szegedy et al., 2016) , the region/bounding boxes for object detection (He et al., 2017; Liu et al., 2015) , or, at the granular level, pixels for scene segmentation (Chen et al., 2018) . Although early image recognition works employed multilayer classification layers (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015) , the more recent models have all been using single layer dense connection (He et al., 2016; Szegedy et al., 2016) or convolutions (Lin et al., 2017) . The vision community has invented a multitude of techniques to enhance the capacity of feature computation layers (Xie et al., 2017; Huang et al., 2017; Hu et al., 2018; Dai et al., 2017; Chollet, 2016; Tan & Le, 2019) . But, the classification layer has mostly retained the form of a multinomial/softmax logistic regression performing a mapping from a set of inputs (images) to a set of categories/labels. As such, the final output of these networks do not furnish a comprehensive depiction about the input entity. In particular, when an existing CNN correctly identifies an image of an English Setter, it is not laid out in the output that it is an instance of a dog, or more extensively, a hunting dog, a domestic animal and a living thing. It is rational to assume that convolutional layers construct some internal representation of the conceptual superclasses, e.g., dog, animal etc., during training. We argue that, by appropriately harnessing such representation, one can retrieve a much broader description of the input image from a CNN than it is supplied by a single layer output. Extensive information about most categories are freely available in repositories such as Word-Net (Fellbaum, 1998) . WordNet provides the hierarchical organization of category classes (e.g., English Setter) and their conceptual superclasses (e.g., Hunting dog, Domestic animal, Living thing). However, a surprisingly limited number of CNNs utilize the concept hierarchy. The primary goal of almost all existing studies is to improve the category-wise classification performance by exploiting the conceptual relations, often via a separate tool. Deng et al. (2014) and Ding et al. (2015) apply graphical models to capture the interdependence among concept labels to improve category classification accuracy. Other works either do not clarify the semantic meaning of the ancestor concepts (Yan et al., 2015) or impose a level of complexity in the additional tool (RNN) that is perhaps unnecessary (Hu et al., 2016) . We have not found an existing (deep learning) model that attempts to predict both the finer categories and the chain of ancestor concepts for an input image by a single network. The classical hedging method (Deng et al., 2012) computes either the finer labels or one of its superclasses exclusively, but not both simultaneously. Figure 1 : The goal of the proposed algorithm. In contrast to the existing methods, our proposed CNN architecture predicts the chain of superclass concepts as well as the finer category. In this paper, we introduce a CNN to classify the category and the concept superclasses simultaneously. As illustrated in Figure 1 , in order to classify any category class (e.g., English Setter), our model is constrained to also predict the ancestor superclasses (e.g., Hunting dog, Domestic animal, Living thing) in the same order as defined in a given ontology. We propose a configuration of multilayer dense connections to predict the category & concept superclasses as well as model their interrelations based on the ontology. We also propose a simple method to prune and rearrange the label hierarchy for efficient connectivity. Capturing the hierarchical relationship within the CNN architecture itself enables us to train the model end-to-end (as opposed to attaching a separate tool) by applying existing optimization strategies for training deep networksfoot_0 . We experimentally demonstrate that one can train the proposed architecture using standard optimization protocols to predict the concept classes with two popular CNN backbones: ResNet and InceptionV4, while maintaining their category-wise accuracy. The proposed multilayer connection is shown to further refine the learned representations of these backbone CNNs to yield better concept and category classification than 1) multinomial logistic regression, and, 2) other existing works (that apply separate mechanisms) on standard datasets and challenging images. Predicting coarser superclasses in addition to finer level categories improves interpretability of the classifier performance. Even if an eagle is misclassified as a parrot, the capability of inferring that it is a bird, and not an artifact (e.g., drone), may be beneficial in some applications (e.g., surveillance). More importantly, an object detector can enhance its capability on unseen categories by adopting the proposed classification scheme (as demonstrated in Section 4.4). For example, a movie/TV violence recognition/detection tool can recognize an equipment as a 'weapon' concept class even if that particular weapon category was not in the training set. In visual question answering (VQA), encoding concept classes would expand the scope of query terms by allowing broader description ('how many vehicles are present' in addition to 'how many buses', 'how many trucks' etc.; see Cao et al. (2018); Wang et al. (2016) ). In Appendix G, we point out how our architecture can be extended to object detectors to compute the concept classes. In addition, we allude to the potential applications of our model to capture label structures different from concept graph, e.g., spatial or compositional dependence.

2. RELEVANT WORKS

Use of hierarchical classifiers can be traced back to the early works of Torralba et al. ( 2004 



One can also envision learning the multilayer connectivity structure from data by architecture learning techniques(Zoph et al., 2017; Pham et al., 2018).



); Wu et al. (2004); Fergus et al. (2010) that shared features for improved classification. Some studies claimed a hierarchical organization of categories resembles how human cognitive system stores knowledge (Zhao et al., 2011) while others experimentally showed a correspondence between structure of semantic hierarchy and visual confusion between categories (Deng et al., 2010). Bengio et al. (2010); Deng et al. (2011) learn a label tree for efficient inference with low theoretical complexity and also suggest a label hierarchy is beneficial for datasets with tens of thousands of categories. Deng et al. (2012) aim to predict either a coarse level concept or a fine level category (but not both at the same time) given an initial classical classifier. Provided an initial classifier output, this method determines category or the coarse concept node (exclusively) with max reward based on aggregated probabilities in a label hierarchy. The reported results suggest the prediction of superclasses comes at the expense of the fine level category classification failure. For CNN based classification, Deng et al. (2014) modeled the relationships such as subsumption, overlap and exclusion among the categories

