SELF-SUPERVISED VIDEO REPRESENTATION LEARN-ING WITH CONSTRAINED SPATIOTEMPORAL JIGSAW

Abstract

This paper proposes a novel pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos. It is motivated by the fact that videos are spatiotemporal by nature and a representation learned to detect spatiotemporal continuity/discontinuity is thus beneficial for downstream video content analysis tasks. A natural choice of such a pretext task is to construct spatiotemporal (3D) jigsaw puzzles and learn to solve them. However, this task turns out to be intractable. We thus propose Constrained Spatiotemporal Jigsaw (CSJ) whereby the 3D jigsaws are formed in a constrained manner to ensure that large continuous spatiotemporal cuboids exist in a shuffled clip to provide sufficient cues for the model to reason about the continuity. With the constrained jigsaw puzzles, instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable but meanwhile still ensure that the learned representation is sensitive to spatiotemporal continuity at both the local and global levels. Extensive experiments show that our CSJ achieves state-of-the-art on two downstream tasks across various benchmarks.

1. INTRODUCTION

Self-supervised learning (SSL) has achieved tremendous successes recently for static images (He et al., 2020; Chen et al., 2020) and shown to be able to outperform supervised learning on a wide range of downstream image understanding tasks. However, such successes have not yet been reproduced for videos. Since different SSL models differ mostly on the pretext tasks employed on the unlabeled training data, designing pretext tasks more suitable for videos is the current focus for self-supervised video representation learning (Han et al., 2020; Wang et al., 2020) . Videos are spatiotemporal data and spatiotemporal analysis is the key to many video content understanding tasks. A good video representation learned from the self-supervised pretext task should therefore capture discriminative information jointly along both spatial and temporal dimensions. It is thus somewhat counter-intuitive to note that most existing SSL pretext tasks for videos do not explicitly require joint spatiotemporal video understanding. For example, some spatial pretext tasks have been borrowed from images without any modification (Jing et al., 2018) , ignoring the temporal dimension. On the other hand, many recent video-specific pretext tasks typically involve speed or temporal order prediction (Lee et al., 2017; Wei et al., 2018; Benaim et al., 2020; Wang et al., 2020) , i.e., operating predominately along the temporal axis. A natural choice for a spatiotemporal pretext task is to solve 3D jigsaw puzzles, whose 2D counterpart has been successfully used for images (Noroozi & Favaro, 2016) . Indeed, solving 3D puzzles requires the learned model to understand spatiotemporal continuity, a key step towards video content understanding. However, directly solving a 3D puzzle turns out to be intractable: a puzzle of 3×3×3 pieces (the same size as a Rubik's cube) can have 27! possible permutations. Video volume even in a short clip is much larger than that. Nevertheless, the latest neural sorting models (Paumard et al., 2020; Du et al., 2020) can only handle permutations a few orders of magnitude less, so offer no solution. This is hardly surprising because such a task is daunting even for humans: Most people would struggle with a standard Rubik's cube, let alone a much larger one. In this paper, we propose a novel Constrained Spatiotemporal Jigsaw (CSJ) pretext task for selfsupervised video representation learning. The key idea is to form 3D jigsaw puzzles in a constrained manner so that it becomes solvable. This is achieved by factorizing the permutations (shuffling) into the three spatiotemporal dimensions and then applying them sequentially. This ensures that for a given video clip, large continuous spatiotemporal cuboids exist after the constrained shuffling to provide sufficient cues for the model to reason about spatiotemporal continuity (see Fig. 1(b)(c) ). Such large continuous cuboids are also vital for human understanding of video as revealed in neuroscience and visual studies (Stringer et al., 2006; Chen et al., 2019) . Even with the constrained puzzles, solving them directly could still be extremely hard. Consequently, instead of directly solving the puzzles (i.e., recovering the permutation matrix so that each piece can be put back), four surrogate tasks are carefully designed. They are more solvable but meanwhile still ensure that the learned representation is sensitive to spatiotemporal continuity at both the local and global levels. Concretely, given a video clip shuffled with our constrained permutations, we make sure that the top-2 largest continuous cuboids (LCCs) dominate the clip volume. The level of continuity in the shuffle clip as a whole is thus determined mainly by the volumes of these LCCs, and whether they are at the right order (see Fig. 1(d )(e)) both spatially and temporally. Our surrogate tasks are thus designed to locate these LCCs and predict their order so that the model learned with these tasks can be sensitive to spatiotemporal continuity both locally and globally. Our main contributions are three-fold: (1) We introduce a new pretext task for self-supervised video representation learning called Constrained Spatiotemporal Jigsaw (CSJ). To our best knowledge, this is the first work on self-supervised video representation learning that leverages spatiotemporal jigsaw understanding. (2) We propose a novel constrained shuffling method to construct easy 3D jigsaws containing large LCCs. Four surrogate tasks are then formulated in place of the original jigsaw solving tasks. They are much more solvable yet remain effective in learning spatiotemporal discriminative representations. (3) Extensive experiments show that our approach achieves state-ofthe-art on two downstream tasks across various benchmarks.

2. RELATED WORK

Self-supervised Learning with Pretext Tasks Self-supervised learning (SSL) typically employs a pretext task to generate pseudo-labels for unlabeled data via some forms of data transformation. According to the transformations used by the pretext task, existing SSL methods for video presentation learning can be divided into three categories: (3) Spatiotemporal Transformations: There are only a few recent approaches (Ahsan et al., 2019; Kim et al., 2019) that leveraged both spatial and temporal transformations by permuting 3D spatiotemporal cuboids. However, due to the aforementioned



Figure 1: Illustration of our constrained jigsaw and the surrogate pretext tasks using an image example (only spatial for clarity). Our constrained jigsaw can be easily extended to the spatiotemporal domain as done in this work. (a): The raw image. (b),(c): Comparing an unconstrained puzzle (b) and our constrained one (c), it is clear that ours is much more continuous (hence interpretable) reflected by the size of the largest continuous cuboids (LCCs, rectangles in images here) shown in red. (d),(e): Illustration of the importance of the relative order of the top-2 LCCs for determining the global continuity level of the shuffled image. (d) and (e) have the same top-2 LCCs, but only (d) keeps the correct relative order between them. Locating these LCCs and predicting their relative order are thus the key objectives of our surrogate tasks.

(1) Spatial-Only Transformations: Derived from the original image domain (Gidaris et al., 2018), Jing et al. (2018) leveraged the spatial-only transformations for self-supervised video presentation learning. (2) Temporal-Only Transformations: Misra et al. (2016); Fernando et al. (2017); Lee et al. (2017); Wei et al. (2018) obtained shuffled video frames with the temporal-only transformations and then distinguished whether the shuffled frames are in chronological order. Xu et al. (2019) chose to shuffle video clips instead of frames. Benaim et al. (2020); Yao et al. (2020); Jenni et al. (2020) exploited the speed transformation via determining whether one video clip is accelerated.

