TRANS-CAPS: TRANSFORMER CAPSULE NETWORKS WITH SELF-ATTENTION ROUTING

Abstract

Capsule Networks (CapsNets) have shown to be a promising alternative to Convolutional Neural Networks (CNNs) in many computer vision tasks, due to their ability to encode object viewpoint variations. The high computational complexity and numerical instability of iterative routing mechanisms stem from the challenging nature of the part-object encoding process. This hinders CapsNets from being utilized effectively in large-scale image tasks. In this paper, we propose a novel non-iterative routing strategy named self-attention routing (SAR) that computes the agreement between the capsules in one forward pass. SAR accomplishes this by utilizing a learnable inducing mixture of Gaussians (IMoG) to reduce the cost of computing pairwise attention values from quadratic to linear time complexity. Our observations show that our Transformer Capsule Network (Trans-Caps) is better suited for complex image tasks including CIFAR-10/100, Tiny-ImageNet and ImageNet when compared to other prominent CapsNet architectures. We also show that Trans-Caps yields a dramatic improvement over its competitors when presented with novel viewpoints on the SmallNORB dataset, outperforming EM-Caps by 5.77% and 3.25% on the novel azimuth and elevation experiments, respectively. Our observations suggest that our routing mechanism is able to capture complex part-whole relationships which allow Trans-Caps to construct reliable geometrical representations of the objects.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in many different computer vision tasks (Krizhevsky et al., 2012; He et al., 2016) . This is achieved by local connectivity and parameter sharing across spatial locations so that useful local features learned in one receptive field can then be detected across the input feature space. While such a mechanism is sufficient to learn relationships between nearby pixels and to detect the existence of objects of interest, CNNs often fail to detect objects presented in radically new viewpoints due to the complex effects of the viewpoint changes on the pixel intensity values. This limitation forces us to train each CNN with a large number of data points which is computationally expensive. Capsule Networks (CapsNets) were introduced to explicitly learn a viewpoint invariant representation of the geometry of an object. In CapsNets, each group of neurons (called a "capsule") encodes and represents the visual features of a higher-level object in an instantiation parameter vector or matrix (which we refer to as the pose vector or matrix throughout this paper). The lower-level capsules (which we refer to as part capsules) estimate the pose of the object parts and hierarchically combine them to predict the pose of the whole object in the next layer. The object-part relationship is viewpoint-invariant, meaning that changes in the viewpoint change the pose of parts and objects in a coordinated way. Therefore, regardless of the viewpoint, we can infer the pose of the whole object from its parts using a set of trainable viewpoint-invariant transformation matrices. Capsule routing mechanisms can therefore learn the underlying spatial relationships between parts and objects. This improves the generalization capabilities of the network due to the underlying linear relationship between the viewpoint changes and the pose matrices. In order to route information between capsules, the part capsules vote for the pose of the higher-level capsules (which we refer to as object capsules). A routing-by-agreement mechanism is employed to aggregate votes (which has been traditionally accomplished using a recurrent clustering procedure) effectively computing the contribution of each part to the object pose. 1

