MULTILAYER DENSE CONNECTIONS FOR HIERARCHI-CAL CONCEPT PREDICTION

Abstract

Classification is a pivotal function for many computer vision tasks such as image recognition, object detection, scene segmentation. Multinomial logistic regression with a single final layer of dense connections has become the ubiquitous technique for CNN-based classification. While these classifiers project a mapping between the input and a set of output category classes, they do not typically yield a comprehensive description of the category. In particular, when a CNN based image classifier correctly identifies the image of a Chimpanzee, its output does not clarify that Chimpanzee is a member of Primate, Mammal, Chordate families and a living thing. We propose a multilayer dense connectivity for a CNN to simultaneously predict the category and its conceptual superclasses in hierarchical order. We experimentally demonstrate that our proposed dense connections, in conjunction with popular convolutional feature layers, can learn to predict the conceptual classes with minimal increase in network size while maintaining the categorical classification accuracy.

1. INTRODUCTION

Classification is a core concept for numerous computer vision tasks. Given the convolutional features, different architectures classify either the image itself (He et al., 2015; Szegedy et al., 2016) , the region/bounding boxes for object detection (He et al., 2017; Liu et al., 2015) , or, at the granular level, pixels for scene segmentation (Chen et al., 2018) . Although early image recognition works employed multilayer classification layers (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015) , the more recent models have all been using single layer dense connection (He et al., 2016; Szegedy et al., 2016) or convolutions (Lin et al., 2017) . The vision community has invented a multitude of techniques to enhance the capacity of feature computation layers (Xie et al., 2017; Huang et al., 2017; Hu et al., 2018; Dai et al., 2017; Chollet, 2016; Tan & Le, 2019) . But, the classification layer has mostly retained the form of a multinomial/softmax logistic regression performing a mapping from a set of inputs (images) to a set of categories/labels. As such, the final output of these networks do not furnish a comprehensive depiction about the input entity. In particular, when an existing CNN correctly identifies an image of an English Setter, it is not laid out in the output that it is an instance of a dog, or more extensively, a hunting dog, a domestic animal and a living thing. It is rational to assume that convolutional layers construct some internal representation of the conceptual superclasses, e.g., dog, animal etc., during training. We argue that, by appropriately harnessing such representation, one can retrieve a much broader description of the input image from a CNN than it is supplied by a single layer output. Extensive information about most categories are freely available in repositories such as Word-Net (Fellbaum, 1998) . WordNet provides the hierarchical organization of category classes (e.g., English Setter) and their conceptual superclasses (e.g., Hunting dog, Domestic animal, Living thing). However, a surprisingly limited number of CNNs utilize the concept hierarchy. The primary goal of almost all existing studies is to improve the category-wise classification performance by exploiting the conceptual relations, often via a separate tool. Deng et al. (2014) and Ding et al. (2015) apply graphical models to capture the interdependence among concept labels to improve category classification accuracy. Other works either do not clarify the semantic meaning of the ancestor concepts (Yan et al., 2015) or impose a level of complexity in the additional tool (RNN) that is perhaps unnecessary (Hu et al., 2016) . We have not found an

