THE LIE DERIVATIVE FOR MEASURING LEARNED EQUIVARIANCE

Abstract

Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we apply the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture. For example, transformers can be more equivariant than convolutional neural networks after training. Figure 1: (Left): The Lie derivative measures the equivariance of a function under continuous transformations, here rotation. (Center): Using the Lie derivative, we quantify how much each layer contributes to the equivariance error of a model. Our analysis highlights surprisingly large contributions from non-linearities, which affects both CNNs and ViT architectures. (Right): Translation equivariance as measured by the Lie derivative correlates with generalization in classification models, across convolutional and non-convolutional architectures. Although CNNs are often noted for their intrinsic translation equivariance, ViT and Mixer models are often more translation equivariant than CNN models after training.

1. INTRODUCTION

Symmetries allow machine learning models to generalize properties of one data point to the properties of an entire class of data points. A model that captures translational symmetry, for example, will have ⇤ Equal contribution. the same output for an image and a version of the same image shifted a half pixel to the left or right. If a classification model produces dramatically different predictions as a result of translation by half a pixel or rotation by a few degrees it is likely misaligned with physical reality. Equivariance provides a formal notion of consistency under transformation. A function is equivariant if symmetries in the input space are preserved in the output space. Baking equivariance into models through architecture design has led to breakthrough performance across many data modalities, including images (Cohen & Welling, 2016; Veeling et al., 2018 ), proteins (Jumper et al., 2021) and atom force fields (Batzner et al., 2022; Frey et al., 2022) . In computer vision, translation equivariance has historically been regarded as a particularly compelling property of convolutional neural networks (CNNs) (LeCun et al., 1995) . Imposing equivariance restricts the size of the hypothesis space, reducing the complexity of the learning problem and improving generalization (Goodfellow et al., 2016) . In most neural networks classifiers, however, true equivariance has been challenging to achieve, and many works have shown that model outputs can change dramatically for small changes in the input space (Azulay & Weiss, 2018; Engstrom et al., 2018; Vasconcelos et al., 2021; Ribeiro & Schön, 2021) . Several authors have significantly improved the equivariance properties of CNNs with architectural changes inspired by careful signal processing (Zhang, 2019; Karras et al., 2021) , but non-architectural mechanisms for encouraging equivariance, such as data augmentations, continue to be necessary for good generalization performance (Wightman et al., 2021) . The increased prominence of non-convolutional architectures, such as vision transformers (ViTs) and mixer models, simultaneously demonstrates that explicitly encoding architectural biases for equivariance is not necessary for good generalization in image classification, as ViT models perform on-par with or better than their convolutional counterparts with sufficient data and well-chosen augmentations (Dosovitskiy et al., 2020; Tolstikhin et al., 2021) . Given the success of large flexible architectures and data augmentations, it is unclear what clear practical advantages are provided by explicit architectural constraints over learning equivariances from the data and augmentations. Resolving these questions systemically requires a unified equivariance metric and large-scale evaluation. In what follows, we introduce the Lie derivative as a tool for measuring the equivariance of neural networks under continuous transformations. The local equivariance error (LEE), constructed with the Lie derivative, makes it possible to compare equivariance across models and to analyze the contribution of each layer of a model to its overall equivariance. Using LEE, we conduct a large-scale analysis of hundreds of image classification models. The breadth of this study allows us to uncover a novel connection between equivariance and model generalization, and the surprising result that ViTs are often more equivariant than their convolutional counterparts after training. To explain this result, we use the layer-wise decomposition of LEE to demonstrate how common building block layers shared across ViTs and CNNs, such as pointwise non-linearities, frequently give rise to aliasing and violations of equivariance. We make our code publicly available at https://github.com/ngruver/lie-deriv.

2. BACKGROUND

Groups and equivariance Equivariance provides a formal notion of consistency under transformation. A function f : V 1 ! V 2 is equivariant to transformations from a symmetry group G if applying the symmetry to the input of f is the same as applying it to the output 8g 2 G : f (⇢ 1 (g)x) = ⇢ 2 (g)f (x), where ⇢(g) is the representation of the group element, which is a linear map V ! V . The most common example of equivariance in deep learning is the translation equivariance of convolutional layers: if we translate the input image by an integer number of pixels in x and y, the output is also translated by the same amount, ignoring the regions close to the boundary of the image. Here x 2 V 1 = V 2 is an image and the representation ⇢ 1 = ⇢ 2 expresses translations of the image. The translation invariance of certain neural networks is also an expression of the equivariance property, but where the output vector space V 2 has the trivial ⇢ 2 (g) = I representation, such that model outputs are unaffected by translations of the inputs. Equivariance is therefore a much richer framework, in which we can reason about representations at the input and the output of a function.

