ROBUST SELF-SUPERVISED LEARNING WITH LIE GROUPS Anonymous

Abstract

Deep learning has led to remarkable advances in computer vision. Even so, today's best models are brittle when presented with variations that differ even slightly from those seen during training. Minor shifts in the pose, color, or illumination of an object can lead to catastrophic misclassifications. State-of-the art models struggle to understand how a set of variations can affect different objects. We propose a framework for instilling a notion of how objects vary in more realistic settings. Our approach applies the formalism of Lie groups to capture continuous transformations to improve models' robustness to distributional shifts. We apply our framework on top of state-of-the-art self-supervised learning (SSL) models, finding that explicitly modeling transformations with Lie groups leads to substantial performance gains of greater than 10% for MAE on both known instances seen in typical poses now presented in new poses, and on unknown instances in any pose. We also apply our approach to ImageNet, finding that the Lie operator improves performance by almost 4%. These results demonstrate the promise of learning transformations to improve model robustness 1 .

1. INTRODUCTION

State-of-the-art models have proven adept at modeling a number of complex tasks, but they struggle when presented with inputs different from those seen during training. For example, while classification models are very good at recognizing buses in the upright position, they fail catastrophically when presented with an upside-down bus since such images are generally not included in standard training sets (Alcorn et al., 2019) . This can be problematic for deployed systems as models are required to generalize to settings not seen during training ("out-of-distribution (OOD) generalization"). One potential explanation for this failure of OOD generalization is that models exploit any and all correlations between inputs and targets. Consequently, models rely on heuristics that while effective during training, may fail to generalize, leading to a form of "supervision collapse" (Jo & Bengio, 2017; Ilyas et al., 2019; Doersch et al., 2020; Geirhos et al., 2020a) . However, a number of models trained without supervision (self-supervised) have recently been proposed, many of which exhibit improved, but still limited OOD robustness (Chen et al., 2020; Hendrycks et al., 2019; Geirhos et al., 2020b) . The most common approach to this problem is to reduce the distribution shift by augmenting training data. Augmentations are also key for a number of contrastive self-supervised approaches, such as SimCLR (Chen et al., 2020) . While this approach can be effective, it has a number of disadvantages. First, for image data, augmentations are most often applied in pixel space, with exceptions e.g. Verma et al. (2019) . This makes it easy to, for example, rotate the entire image, but very difficult to rotate a single object within the image. Since many of the variations seen in real data cannot be approximated by pixel-level augmentations, this can be quite limiting in practice. Second, similar to adversarial training (Madry et al., 2017; Kurakin et al., 2016) , while augmentation can improve performance on known objects, it often fails to generalize to novel objects (Alcorn et al., 2019) . Third, augmenting to enable generalization for one form of variation can often harm the performance on other forms of variation (Geirhos et al., 2018; Engstrom et al., 2019) , and is not guaranteed to provide the expected invariance to variations (Bouchacourt et al., 2021b) . Finally, enforcing invariance is not guaranteed to provide the correct robustness that generalizes to new instances (as discussed in Section 2). Figure 1: Summary of approach and gains. We generate a novel dataset containing rendered images of objects in typical and atypical poses, with some instances only seen in typical, but not atypical poses (left). Using these data, we augment SSL models such as MAE with a learned Lie operator which approximates the transformations in the latent space induced by changes in pose (middle). Using this operator, we improve performance by >10% for MAE for both known instances in new poses and unknown instances in both typical and atypical poses (right). For these reasons, we choose to explicitly model the transformations of the data as transformations in the latent representation rather than trying to be invariant to it. To do so, we use the formalism of Lie groups. Informally, Lie groups are continuous groups described by a set of real parameters (Hall, 2003) . While many continuous transformations form matrix Lie groups (e.g., rotations), they lack the typical structure of a vector space. However, Lie group have a corresponding vector space, their Lie algebra, that can be described using basis matrices, allowing to describe the infinite number elements of the group by a finite number of basis matrices. Our goal will be to learn such matrices to directly model the data variations. To summarize, our approach structures the representation space to enable self-supervised models to generalize variation across objects. Since many naturally occurring transformations (e.g., pose, color, size, etc.) are continuous, we develop a theoretically-motivated operator, the Lie operator, that acts in representation space (see Fig. 1 ). Specifically, the Lie operator learns the continuous transformations observed in data as a vector space, using a set of basis matrices. With this approach, we make the following contributions: 1. We generate a novel dataset containing 3D objects in many different poses, allowing us to explicitly evaluate the ability of models to generalize to both known objects in unknown poses and to unknown objects in both known and unknown poses (Section 3). 2. Using this dataset, we evaluate the generalization capabilities of a number of standard models, including ResNet-50, ViT, MLP-Mixer, SimCLR, CLIP, VICReg, and MAE finding that all state-of-the-art models perform relatively poorly in this setting (Section 3.2). 3. We incorporate our proposed Lie operator in two recent SSL approaches: masked autoencoders (MAE, (He et al., 2021) ), and Variance-Invariance-Covariance Regularization (VICReg (Bardes et al., 2021) ) to directly model transformations in data (Section 2), resulting in substantial OOD performance gains of greater than 10% for MAE and of up to 8% for VICReg (Section 4.1). We also incorporate our Lie model in SimCLR (Chen et al., 2020) (Appendix E). 4. We run systemic ablations of each term of our learning objective in the MAE Lie model, showing the relevance of every component for best performance (Section 4.3).



Code to reproduce all experiments will be available upon acceptance.

