EMPOWERING NETWORKS WITH SCALE AND ROTA-TION EQUIVARIANCE USING A SIMILARITY CONVOLU-TION

Abstract

The translational equivariant nature of Convolutional Neural Networks (CNNs) is a reason for its great success in computer vision. However, networks do not enjoy more general equivariance properties such as rotation or scaling, ultimately limiting their generalization performance. To address this limitation, we devise a method that endows CNNs with simultaneous equivariance with respect to translation, rotation, and scaling. Our approach defines a convolution-like operation and ensures equivariance based on our proposed scalable Fourier-Argand representation. The method maintains similar efficiency as a traditional network and hardly introduces any additional learnable parameters, since it does not face the computational issue that often occurs in group-convolution operators. We validate the efficacy of our approach in the image classification task, demonstrating its robustness and the generalization ability to both scaled and rotated inputs.

1. INTRODUCTION

The remarkable success of network architectures can be largely attributed to the availability of large datasets and a large number of parameters, enabling them to "remember" vast amounts of information. On the contrary, humans can learn new concepts with very little data and are able to generalize this knowledge. This disparity is due, in part, to the current limitations in modeling geometric deformations in network architectures. Networks are inclined to "remember" data through filter parameters rather than "learning" a full generalization ability. For instance, in classification tasks, networks trained on datasets with specific object sizes often fail when tested on objects with different sizes that were not present in the training set. The ability to factor out transformations, such as rotation or scaling, in the learning process remains to be addressed. It is indeed quite frequent to deal with images in which objects have a different orientation and scale than in the training set, for instance, as a result of distance and orientation changes of the camera. To mitigate this issue, it is common practice to perform data augmentation (Krizhevsky et al., 2012) prior to training. However, this leads to a substantially larger dataset and makes training more complicated. Moreover, this strategy tends to learn a group of duplicates of almost the same filters, which often requires more learnable parameters to achieve competitive performance. A visualization of the weights of the first layer (Zeiler & Fergus, 2014) highlights that many filters are similar but rotated and scaled versions of a common prototype, which results in significantly more redundancy. The concept of equivariance emerged as a potential solution to this issue. Simply put, equivariance requires that if a given input undergoes a specific geometric transformation, the resulting output feature from the network (with weights randomly initialized) should exhibit a similarly predictable geometric transformation. Should a network satisfy equivariance to scalings and rotations, training it with only one size and orientation would naturally generalize its performance to all sizes and orientations. To achieve this property, group convolution methods have been widely used in this field. An oversimplified interpretation of a typical group convolution method is as follows: Features are convolved with dilated filters of the same template to obtain multi-channel features, where the distortion of input features corresponds to the cycle shift between channels. For example, equivariant CNNs on truncated directions (Cohen & Welling, 2016; Zhou et al., 2017) leverage several directional filters to obtain equivariance within a discrete group. Further works extended rotation equivariance to a continuous group, using techniques such as steerable filters (Weiler et al., 2018; Cohen et al., 2019) , B-spline interpolation (Bekkers, 2020) or Lie Group Theory (Bekkers, 2020; Finzi et al., 2020) . A similar path to scaling equivariance has been explored, although scaling is no longer intrinsically periodic. The deep scale space (Worrall & Welling, 2019) defined a semi-symmetry group to approximately achieve scale equivariance, while Sosnovik et al. ( 2020) applied steerable CNNs to scaling. However, integrating equivariance to both rotations and scalings simultaneously leads to a larger group (e.g., a rotation group with M points and a scaling group with N points results in M × N points for the joint rotation and scaling group), making the task more challenging. Additionally, certain "weight-sharing" techniques based on group convolution can be computationally and memory-intensive. Despite the difficulties, empowering the model with equivariance of rotation and scaling together can be highly advantageous. For instance, in object detection, the distance changes between the camera and the object, or the random rotations of the object, can significantly impact the accuracy of the method. The aim of this paper is to propose a CNN architecture that achieves continuous equivariance with respect to both rotation and scaling, thereby filling a void in the field of equivariant methods. To accomplish this, we provide a theoretical framework and analysis that guarantees the network preserves this inherent property. Based on this, we propose a new architecture, the Scale and Rotation Equivariant Network (SREN), that avoids the abovementioned limitations and does not sizably increase computational complexity. Specifically, we first designed a scalable Fourier-Argand representation. The expression of the basis makes it possible to operate the angle and scale in one shot. Further, we propose a new convolution-like operator that is functionally similar to traditional convolution but with slight differences. We show that this new method has similar computational complexity to convolution and can easily replace the typical network structures. Our approach can model both rotation and scaling, enabling it to consistently achieve accurate results when tested with datasets that have undergone different transformations, such as rotation and scaling. The main contributions in this paper are summarized as follows: • We introduce the scalable Fourier-Argand representation, which enables achieving equivariance on rotation and scaling. • We propose the SimConv operator, along with the scalable Fourier-Argand filter, forming the Scale and Rotation Equivariant Network (SREN) architecture. • SREN is an equivariant network for rotation and scaling that is distinct from group-convolutional neural networks, offering a new possible path to solving this problem for the community.

2. RELATED WORK

Group convolution To accomplish equivariance property, a possible direction is the application of group theory to achieve equivariance. Cohen & Welling (2016) introduced group convolution, which enforces equivariance to a small and discrete group of transformations, i.e., rotations by multiples of 90 degrees. Subsequent efforts have aimed to generalize equivariance (Zhou et al., 2017) and also focus on continuous groups coupled with the idea of steerable filters (Cohen & Welling, 2017) . To achieve this purpose, Lie group theory has also been utilized, as presented in works such as LieConv (Finzi et al., 2020) , albeit only for compact groups. Unfortunately, the scaling group is non-compact, so methods typically treat it as a semi-group and use approximations to achieve truncated scaling equivariance. TridentNet (Li et al., 2019) gets scale invariance by sharing weights among kernels with different dilation rates. Another approach is to apply scale-space theory (Lindeberg, 2013), which



Figure The visualization of the Sim(2) equivariance property: Our SREN method inherently retains the structure information of the input, enabling it to handle distorted objects (rotation, scaling, and translation) without additional training.

