SELF-SUPERVISION THROUGH RANDOM SEGMENTS WITH AUTOREGRESSIVE CODING (RANDSAC)

Abstract

Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), and advances in recent visual architecture design with Vision Transformers (ViTs), in this paper, we explore the effect various design choices have on the success of applying such training strategies for visual feature learning. Specifically, we introduce a novel strategy that we call Random Segments with Autoregressive Coding (RandSAC). In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT. We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment) predictions which are effective for feature learning. We illustrate the pertinence of these design choices and explore alternatives on a number of datasets (e.g., CIFAR10, CIFAR100, ImageNet). While our pre-training strategy works with vanilla Transformer, we also propose a conceptually simple, but highly effective, addition to the decoder that allows learnable skip-connections to encoder's feature layers, which further improves the performance.

1. INTRODUCTION

Deep learning has powered enormous successes in Computer Vision and NLP over the past 10, or so, years. It has lead to significant improvements in object detection (Redmon et al., 2016) , segmentation (He et al., 2017) , as well as higher-level cognition tasks (e.g., Visual Question Answering (Antol et al., 2015) , Visual Navigation (Mayo et al., 2021) , etc.). These successes have been enabled by both advances in parallel hardware (GPUs) and, perhaps more importantly, large-scale task-specific labeled datasets that allow supervised learning. This appetite for large data has, until very recently, stagnated progress, particularly in building general-purpose visual architectures. These types of considerations date back to the early days of machine learning, and deep learning in particular, where it has long been postulated that unsupervised, or self-supervised, learning could allow learning of robust and general feature representations that can then be readily used (or finetuned) to target tasks. Self-supervised learning has been explored in computer vision in various forms: denoising autoencoders (Pathak et al., 2016; Vincent et al., 2008 ), colorization (Zhang et al., 2016) or jigsaw puzzle (Doersch et al., 2015; Noroozi & Favaro, 2016) proxy objectives. However, the success of such self-supervised pre-training was somewhat limited. In contrast, the success of similar self-supervised ideas in NLP has been much more dominant with GPT (Brown et al., 2020) and BERT (Devlin et al., 2018) architectures, and their variants. These pre-training strategies now enable state-of-the-art performance on a wide array of natural language tasks. Recent advances in vision architectures, such as Vision Transformers (ViT) (Dosovitskiy et al., 2021; Liu et al., 2021) , which serialize visual 2d data, have opened an opportunity to apply similar large scale pre-training techniques in vision, with increasing successes. Self-supervised pre-training techniques with ViTs can be characterized into two broad categories: contrastive and predictive; as well as their combinations. In contrastive learning, pre-training architectures are learned to be invariant to certain perturbations in data (e.g., spatial shifts, color jitter) by forming positive and Motivated by this, we adopt an autoregressive pre-training strategy (see Figure 1 ) and ask a number of important empirical questions about the use of such pre-training and what makes it effective. Specifically, (1) we ask what granularity (scale) and shape of tokens (patches, blobs) is most effective and how it affects the performance? (2) How best to serialize predictions? For example, previous approaches, such as image GPT (Chen et al., 2020a) , leveraged raster ordering. While such ordering is perhaps "optimal" from correlation and predictive/generative (van den Oord et al., 2016) points of view, we show that it is not optimal for general feature learning. We also explore (3) whether deterministic vs. stochastic tokenization and serialization are helpful. Finally, (4) we explore the effective interactions between the decoder and encoder layers; proposing a new ViT architecture that uses learned skip connections between encoder and decoder layers to improve performance. Contributions. We make two core contributions. First, we propose a new pre-training strategy that leverages (randomly) sampled hierarchical segment cluster traversals to autoregresively train ViT models. This allows both short-and long-term spatial predictions, allowing distribution over easy and hard predictive tasksfoot_0 . We note that the effectiveness of single random segment inpainting was initially observed in (Pathak et al., 2016) , but is notably missing from most recent self-supervised strategies. Our pre-training strategy generalizes this observation and strategy to hierarchical and serialized predictions. Second, we propose a flexible ViT decoder that at each decoding layer learns to dynamically attend over different levels of features in the encoder. This in effect creates learned skip-connections, as compared to UNet (Ronneberger et al., 2015) and others that require fixed connections in a symmetric encoder-decoder design, which further improve the performance. Discussion. The above pre-training strategy, while empirically motivated, is also loosely modeled after human vision. Humans attend to the scene by a sequence of foveal observations, where an eye shifts over a series of fixation points; such motions are called saccades. Some saccades are longrange and voluntary, while others are local and involuntary (a.k.a., microsaccades (Rolfs, 2009) ). Our segments can be "viewed" as predictive foveal regions, and the hierarchical serialization of such regions as the combination of micro and macro saccades. The significant difference from human vision, is that in human vision saccades are purposeful and have been shown to be conditioned on the task (Yarbus, 1967) . In contrast, our pre-training such "saccadic" movements are randomly sampled.



This is, in part, motivated by (He et al., 2021) which observe that in BERT-style pre-training high amount of masking (as much as 75%), which corresponds to harder predictive tasks, leads to better feature learning.



Figure 1: Randomized Autoregressive Segment Prediction. Illustration of our autoregressive segment prediction framework (RandSAC). RandSAC breaks the image into tokens which are arranged into segments (here squares of size 2 × 2). The autoregressive (GPT-style) transformer-based model is then trained to predict segments in a randomly sampled serialization order. As a result, tokens within segments are predicted in parallel, while segments themselves are predicted sequentially.negative pairings of augmented data samples. This is a powerful technique, but requires designers to make assumptions about invariances that the architecture should learn. In addition, purely contrastive models tend to incorporate center bias (Chen et al., 2022; 2021a), which makes them less transferable for tasks such as segmentation where non-object centric regions need to be modeled. Alternatively, predictive models learn to predict elements of the scene, either in parallel by reconstructing masked regions/tokens(Bao et al., 2022; He et al., 2021) (a.k.a., masked image modeling or BERT-style pre-training) or to predict images in auto-regressive language-modeling manner (Chen et al., 2020a) (a.k.a., GPT-style pre-training). It is interesting to observe that on the NLP side, GPT models have shown to be powerful, while vision models have gravitated more towards BERT-style pre-training both with visual(Chen et al., 2020a; Bao et al., 2022)  and multi-modal data(Lu et al., 2019; Su et al., 2020).

