DEFORMABLE CAPSULES FOR OBJECT DETECTION

Abstract

Capsule networks promise significant benefits over convolutional networks by storing stronger internal representations, and routing information based on the agreement between intermediate representations' projections. Despite this, their success has been mostly limited to small-scale classification datasets due to their computationally expensive nature. Recent studies have partially overcome this burden by locally-constraining the dynamic routing of features with convolutional capsules. Though memory efficient, convolutional capsules impose geometric constraints which fundamentally limit the ability of capsules to model the pose/deformation of objects. Further, they do not address the bigger memory concern of class-capsules scaling-up to bigger tasks such as detection or large-scale classification. In this study, we introduce deformable capsules (DeformCaps), a new capsule structure (SplitCaps), and a novel dynamic routing algorithm (SE-Routing) to balance computational efficiency with the need for modeling a large number of objects and classes. We demonstrate that the proposed methods allow capsules to efficiently scale-up to large-scale computer vision tasks for the first time, and create the first-ever capsule network for object detection in the literature. Our proposed architecture is a one-stage detection framework and obtains results on MS COCO which are on-par with state-of-the-art one-stage CNN-based methods, while producing fewer false positive detections.

1. INTRODUCTION

Capsule networks promise many potential benefits over convolutional neural networks (CNNs). These include practical benefits, such as requiring less data for training or better handling unbalanced class distributions (Jiménez-Sánchez et al., 2018) , and important theoretical benefits, such as buildingin stronger internal representations of objects (Punjabi et al., 2020) , and modeling the agreement between those intermediate representations which combine to form final object representations (e.g. part-whole relationships) (Kosiorek et al., 2019; Sabour et al., 2017) . Although these benefits might not be seen in the performance metrics (e.g. average precision) on standard benchmark computer vision datasets, they are important for real-world applications. As an example, it was found by Alcorn et al. ( 2019) that CNNs fail to recognize 97% of their pose space, while capsule networks have been shown to be far more robust to pose variations of objects (Hinton et al., 2018) ; further, real-world datasets are not often as extensive and cleanly distributed as ImageNet or MS COCO. These benefits are achieved in capsule networks by storing richer vector (or matrix) representations of features, rather than the simple scalars of CNNs, and dynamically choosing how to route that information through the network. The instantiation parameters for a feature are stored in these capsule vectors and contain information (e.g. pose, deformation, hue, texture) useful for constructing the object being modeled. Early studies have shown strong evidence that these vectors do in fact capture important local and global variations across objects' feature components (or parts) within a class (Punjabi et al., 2020; Sabour et al., 2017) . Inside their networks, capsules dynamically route their information, seeking to maximize the agreement between these vector feature representations and the higher-level feature vectors they are attempting to form. Despite their potential benefits, many have remained unconvinced about the general applicability of capsule networks to large-scale computer vision tasks. To date, no capsule-based study has achieved classification performance comparable to a CNN on datasets such as ImageNet, instead relegated to smaller datasets such as MNIST or CIFAR. Worse still, to the best of our knowledge, no capsule network has shown successful results in object detection, a very important problem in computer 1

