TRANS-CAPS: TRANSFORMER CAPSULE NETWORKS WITH SELF-ATTENTION ROUTING

Abstract

Capsule Networks (CapsNets) have shown to be a promising alternative to Convolutional Neural Networks (CNNs) in many computer vision tasks, due to their ability to encode object viewpoint variations. The high computational complexity and numerical instability of iterative routing mechanisms stem from the challenging nature of the part-object encoding process. This hinders CapsNets from being utilized effectively in large-scale image tasks. In this paper, we propose a novel non-iterative routing strategy named self-attention routing (SAR) that computes the agreement between the capsules in one forward pass. SAR accomplishes this by utilizing a learnable inducing mixture of Gaussians (IMoG) to reduce the cost of computing pairwise attention values from quadratic to linear time complexity. Our observations show that our Transformer Capsule Network (Trans-Caps) is better suited for complex image tasks including CIFAR-10/100, Tiny-ImageNet and ImageNet when compared to other prominent CapsNet architectures. We also show that Trans-Caps yields a dramatic improvement over its competitors when presented with novel viewpoints on the SmallNORB dataset, outperforming EM-Caps by 5.77% and 3.25% on the novel azimuth and elevation experiments, respectively. Our observations suggest that our routing mechanism is able to capture complex part-whole relationships which allow Trans-Caps to construct reliable geometrical representations of the objects.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in many different computer vision tasks (Krizhevsky et al., 2012; He et al., 2016) . This is achieved by local connectivity and parameter sharing across spatial locations so that useful local features learned in one receptive field can then be detected across the input feature space. While such a mechanism is sufficient to learn relationships between nearby pixels and to detect the existence of objects of interest, CNNs often fail to detect objects presented in radically new viewpoints due to the complex effects of the viewpoint changes on the pixel intensity values. This limitation forces us to train each CNN with a large number of data points which is computationally expensive. Capsule Networks (CapsNets) were introduced to explicitly learn a viewpoint invariant representation of the geometry of an object. In CapsNets, each group of neurons (called a "capsule") encodes and represents the visual features of a higher-level object in an instantiation parameter vector or matrix (which we refer to as the pose vector or matrix throughout this paper). The lower-level capsules (which we refer to as part capsules) estimate the pose of the object parts and hierarchically combine them to predict the pose of the whole object in the next layer. The object-part relationship is viewpoint-invariant, meaning that changes in the viewpoint change the pose of parts and objects in a coordinated way. Therefore, regardless of the viewpoint, we can infer the pose of the whole object from its parts using a set of trainable viewpoint-invariant transformation matrices. Capsule routing mechanisms can therefore learn the underlying spatial relationships between parts and objects. This improves the generalization capabilities of the network due to the underlying linear relationship between the viewpoint changes and the pose matrices. In order to route information between capsules, the part capsules vote for the pose of the higher-level capsules (which we refer to as object capsules). A routing-by-agreement mechanism is employed to aggregate votes (which has been traditionally accomplished using a recurrent clustering procedure) effectively computing the contribution of each part to the object pose. While various proposed iterative routing mechanisms (such as Dynamic (Sabour et al., 2017) and EM (Hinton et al., 2018) routing) have been shown to be effective in the detection of viewpoint variations, their iterative nature increases computational cost. Prior research has additionally shown that these routing mechanisms may fail to properly construct a parse tree between each set of part and object capsules, partly due to the inability of the network to learn routing weights through back-propagation (Peer et al., 2018) . This ultimately limits the performance of CapsNets in realworld image classification tasks. Additionally, the correct number of routing iterations serves as an additional data-dependent hyper-parameter that needs to be carefully selected; failing to optimize the number of routing operations can result in increased bias or variance in the model (Hinton et al., 2018) . This issue is amplified when training networks with multiple capsule layers. In this paper, we introduce a novel routing algorithm called self-attention routing (SAR), which is inspired by the structural resemblance between CapsNets and Transformer networks (Vaswani et al., 2017) . This mechanism eliminates the need for recursive computations by replacing unsupervised routing procedures with a self-attention module, making the use of CapsNets effective in complex and large-scale image classification tasks. Our algorithm also reduces the risk of under and over-fitting associated with selecting a small and large number of routing iterations, respectively. We compare our proposed routing algorithm to two of the most prominent iterative methods, namely dynamic and EM routing, and the recently published non-iterative self-routing mechanism (Hahn et al., 2019) . We evaluate performance on several image classification datasets including SVHN, CIFAR-10, CIFAR-100, Tiny-ImageNet, ImageNet, and SmallNORB. Our results show that our model outperforms other baseline CapsNets and achieves better classification performance and convergence speed while requiring significantly fewer trainable parameters, fewer computations (in FLOPs), and less memory. Moreover, our experimental result on the SmallNORB dataset with novel viewpoints shows that the proposed model is significantly more robust to changes in the viewpoint and is able to retain its performance under severe viewpoint shifts. All source code will be made publicly available.

2.1. CAPSULE NETWORKS

CapsNets were originally introduced in Transforming Autoencoders by Hinton et al. (2011) ; here they pose computer vision tasks as inverse graphics problems to deal with variations in an object's instantiation parameters. This architecture learns to reconstruct an affine-transformed version of the input image, therefore learning to represent each input as a combination of its parts and their respective characteristics. Sabour et al. (2017) introduced capsules with Dynamic Routing (DR-Caps), which allows the network to learn part-whole relationships through an iterative unsupervised clustering procedure. In DR-Caps, capsules output a pose vector whose length (norm or magnitude) implicitly represents the capsule activation. The vector norm should be able to scale depending on the pose values; representing existence with the vector norm can therefore potentially weaken the representation power of any given capsule layer. Hinton et al. (2018) proposed capsules with EM routing (EM-Caps), where capsule activations and pose matrices are segregated to fit the votes from part capsules through a mixture of Gaussians. While powerful, capsule network's routing procedures have several fundamental limitations:: 1) Iterative routing operations are the bottleneck of CapsNets due to their computational complexity, which limits their widespread applicability in complex, large-scale datasets (Zhang et al., 2018; Li et al., 2018) . 2) The number of routing iterations are hyper-parameters that need to be carefully tuned to prevent under and over fitting (Hinton et al., 2018) . 3) Lin et al. (2018) showed that even after seven iterative routing operations, the entropy of the coupling coefficient was still large, indicating that part capsules pass information to all object capsules. 4) EM-Caps have difficulty converging and have been shown to be numerically unstable, which limits their applicability in complex tasks (Ahmed & Torresani, 2019; Gritzman, 2019) . Several studies have proposed non-iterative methods to replace the traditional iterative routing mechanisms in CapsNets. STAR-CAPS (Ahmed & Torresani, 2019) combines an attention gate with a straight-through estimator to make a binary decision to either connect or disconnect the route between each part and object capsule. Tsai et al. (2020) proposed an inverted dot-product attention routing mechanism (IDPA-Caps) which generates the routing coefficients between capsules; they unroll the iterative routing procedure and perform the iterations concurrently which helps improve

