TIME TO AUGMENT SELF-SUPERVISED VISUAL REPRESENTATION LEARNING

Abstract

Biological vision systems are unparalleled in their ability to learn visual representations without supervision. In machine learning, self-supervised learning (SSL) has led to major advances in forming object representations in an unsupervised fashion. Such systems learn representations invariant to augmentation operations over images, like cropping or flipping. In contrast, biological vision systems exploit the temporal structure of the visual experience during natural interactions with objects. This gives access to "augmentations" not commonly used in SSL, like watching the same object from multiple viewpoints or against different backgrounds. Here, we systematically investigate and compare the potential benefits of such time-based augmentations during natural interactions for learning object categories. Our results show that incorporating time-based augmentations achieves large performance gains over state-of-the-art image augmentations. Specifically, our analyses reveal that: 1) 3-D object manipulations drastically improve the learning of object categories; 2) viewing objects against changing backgrounds is important for learning to discard background-related information from the latent representation. Overall, we conclude that time-based augmentations during natural interactions with objects can substantially improve self-supervised learning, narrowing the gap between artificial and biological vision systems.

1. INTRODUCTION

Learning object representations without supervision is a grand challenge for artificial vision systems. Recent approaches for visual self-supervised learning (SSL) acquire representations invariant to data-augmentations based on simple image manipulations like crop/resize, blur or color distortion (Grill et al., 2020; Chen et al., 2020) . The nature of these augmentations determines what information is retained and what information is discarded, and therefore how useful these augmentations are for particular downstream tasks (Jaiswal et al., 2021; Tsai et al., 2020) . Biological vision systems, in contrast, appear to exploit the temporal structure of visual input during interactions with objects for unsupervised representation learning. According to the slowness principle (Wiskott and Sejnowski, 2002; Li and DiCarlo, 2010; Wood and Wood, 2018) , biological vision systems strive to discard high-frequency variations (e.g. individual pixel intensities) and retain slowly varying information (e.g. object identity) in their representation. Incorporating this idea into SSL approaches has led to recent time-contrastive learning methods that learn to map inputs occurring close in time onto close-by latent representations (Oord et al., 2018; Schneider et al., 2021) . How these systems generalize depends on the temporal structure of their visual input. In particular, visual input arising from embodied interactions with objects may lead to quite different generalizations compared to what is possible with simple image manipulations. For instance, human infants learning about objects interact with them in various ways (Smith et al., 2018) . First, infants rotate, bring closer/farther objects while playing with them (Byrge et al., 2014) . Second, as they gain mobility, they can move in the environment while holding an object, viewing it in different contexts and against different backgrounds. We refer to (simulations of) such interactions as natural interactions. Here, we systematically study the impact of such natural interactions on representations learnt through time-contrastive learning in different settings. We introduce two new simulation environments based on the near-photorealistic simulation platform ThreeDWorld (TDW) (Gan et al., 2021) and combine them with a recent dataset of thousands of 3D object models (Toys4k) (Stojanov et al., 2021) . Then we validate our findings on two video datasets of real human object manipulations, ToyBox (Wang et al., 2018)  and CORe50 (Lomonaco and Maltoni, 2017) . Our experiments show that adding time-based augmentations to conventional data-augmentations considerably improves category recognition. Furthermore, we show that the benefit of time-based augmentations during natural interactions has two main origins. First, 3-D object rotations boost generalization across object shapes. Second, viewing objects against different backgrounds while moving with them reduces harmful effects of background clutter. We conclude that exploiting natural interactions via time-contrastive learning greatly improves self-supervised visual representation learning.

2. RELATED WORK

Data-augmented self-supervised learning. The general idea behind most recent approaches for self-supervised learning is that two semantically close/different inputs should be mapped to close/distant points in the learnt representation space. Applying a certain transformation to an image such as flipping it horizontally generates an image that will be very different at the pixel level, but has a similar semantic meaning. A learning objective for SSL will therefore try to ascertain that the representations of an image and its augmented version are close in latent space, while being far from the representations of other unrelated images. A concrete approach may work as follows: sample an image x, apply transformations to it taken from a predefined set of transformations, resulting in new images x ′ , also called a positive pair. The same procedure is applied to a batch of different images. Embeddings of positive pairs are brought together while keeping the embeddings of the batch overall distant from one another. There are three main categories of approaches for doing so: contrastive learning methods (Chen et al., 2020; He et al., 2020) explicitly push away the embeddings of a batch of inputs from one another; distillation-based methods (Grill et al., 2020; Chen and He, 2021) use an asymmetric embedding architecture, allowing the model to discard the "push away" part; entropy maximization methods (Bardes et al., 2022; Ermolov and Sebe, 2020 ) maintain a high entropy in the embedding space. Image manipulations as data-augmentations. Most self-supervised learning approaches have used augmentations based on simple image manipulations to learn representations. Frequently used are color distortion, cropping/resizing a part of an image, the horizontal flipping of an image, gray scaling the image, and blurring the image (Chen et al., 2020) . Other augmentations can be categorized in three ways (Shorten and Khoshgoftaar, 2019; Jaiswal et al., 2021) : 1) geometric augmentations include image rotations (Chen et al., 2020) or image translations (Shorten and Khoshgoftaar, 2019); 2) Context-based augmentations include jigsaw puzzle augmentations (Noroozi and Favaro, 2016; Misra and Maaten, 2020) , pairing images (Inoue, 2018), greyed stochastic/saliency-based occlusion (Fong and Vedaldi, 2019; Zhong et al., 2020) , or automatically modifying the background (Ryali et al., 2021) ; 3) Color-oriented transformations can be the selection of color channels (Tian et al., 2020) or Gaussian noise (Chen et al., 2020) . A related line of work also proposes learning how to generate/select data-augmentations (Cubuk et al., 2019; Tian et al., 2020) , but since it takes advantage of labels, the approach is no longer self-supervised. Time-based data-augmentations. Several works have proposed using the temporality of interactions to learn visual representations. A recent line of work proposes to learn embeddings of video frames using the temporal contiguity of frames: Knights et al. (2021) propose a learning objective that makes codes of adjacent frames within a video clip similar, however the system still needs to have information about where each video starts and ends. In contrast, our setups expose the system to a continuous stream of visual inputs. Other methods showed the importance of time-based augmentations based on videos for object tracking (Xu and Wang, 2021), category recognition (Gordon et al., 2020; Parthasarathy et al., 2022; Orhan et al., 2020) or adversarial robustness (Kong and Norcia, 2021) . Unlike us they do not make an in-depth analysis of the impact of different kinds of natural interactions. Schneider et al. (2021) showed the importance of natural interactions with objects for learning object

