EMPOWERING NETWORKS WITH SCALE AND ROTA-TION EQUIVARIANCE USING A SIMILARITY CONVOLU-TION

Abstract

The translational equivariant nature of Convolutional Neural Networks (CNNs) is a reason for its great success in computer vision. However, networks do not enjoy more general equivariance properties such as rotation or scaling, ultimately limiting their generalization performance. To address this limitation, we devise a method that endows CNNs with simultaneous equivariance with respect to translation, rotation, and scaling. Our approach defines a convolution-like operation and ensures equivariance based on our proposed scalable Fourier-Argand representation. The method maintains similar efficiency as a traditional network and hardly introduces any additional learnable parameters, since it does not face the computational issue that often occurs in group-convolution operators. We validate the efficacy of our approach in the image classification task, demonstrating its robustness and the generalization ability to both scaled and rotated inputs.

1. INTRODUCTION

The remarkable success of network architectures can be largely attributed to the availability of large datasets and a large number of parameters, enabling them to "remember" vast amounts of information. On the contrary, humans can learn new concepts with very little data and are able to generalize this knowledge. This disparity is due, in part, to the current limitations in modeling geometric deformations in network architectures. Networks are inclined to "remember" data through filter parameters rather than "learning" a full generalization ability. For instance, in classification tasks, networks trained on datasets with specific object sizes often fail when tested on objects with different sizes that were not present in the training set. The ability to factor out transformations, such as rotation or scaling, in the learning process remains to be addressed. It is indeed quite frequent to deal with images in which objects have a different orientation and scale than in the training set, for instance, as a result of distance and orientation changes of the camera. To mitigate this issue, it is common practice to perform data augmentation (Krizhevsky et al., 2012) prior to training. However, this leads to a substantially larger dataset and makes training more complicated. Moreover, this strategy tends to learn a group of duplicates of almost the same filters, which often requires more learnable parameters to achieve competitive performance. A visualization of the weights of the first layer (Zeiler & Fergus, 2014) highlights that many filters are similar but rotated and scaled versions of a common prototype, which results in significantly more redundancy. The concept of equivariance emerged as a potential solution to this issue. Simply put, equivariance requires that if a given input undergoes a specific geometric transformation, the resulting output feature from the network (with weights randomly initialized) should exhibit a similarly predictable geometric transformation. Should a network satisfy equivariance to scalings and rotations, training it with only one size and orientation would naturally generalize its performance to all sizes and orientations. To achieve this property, group convolution methods have been widely used in this field. An oversimplified interpretation of a typical group convolution method is as follows: Features are convolved with dilated filters of the same template to obtain multi-channel features, where the distortion of input features corresponds to the cycle shift between channels. For example, equivariant

