SELF-SUPERVISED VIDEO REPRESENTATION LEARN-ING WITH CONSTRAINED SPATIOTEMPORAL JIGSAW

Abstract

This paper proposes a novel pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos. It is motivated by the fact that videos are spatiotemporal by nature and a representation learned to detect spatiotemporal continuity/discontinuity is thus beneficial for downstream video content analysis tasks. A natural choice of such a pretext task is to construct spatiotemporal (3D) jigsaw puzzles and learn to solve them. However, this task turns out to be intractable. We thus propose Constrained Spatiotemporal Jigsaw (CSJ) whereby the 3D jigsaws are formed in a constrained manner to ensure that large continuous spatiotemporal cuboids exist in a shuffled clip to provide sufficient cues for the model to reason about the continuity. With the constrained jigsaw puzzles, instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable but meanwhile still ensure that the learned representation is sensitive to spatiotemporal continuity at both the local and global levels. Extensive experiments show that our CSJ achieves state-of-the-art on two downstream tasks across various benchmarks.

1. INTRODUCTION

Self-supervised learning (SSL) has achieved tremendous successes recently for static images (He et al., 2020; Chen et al., 2020) and shown to be able to outperform supervised learning on a wide range of downstream image understanding tasks. However, such successes have not yet been reproduced for videos. Since different SSL models differ mostly on the pretext tasks employed on the unlabeled training data, designing pretext tasks more suitable for videos is the current focus for self-supervised video representation learning (Han et al., 2020; Wang et al., 2020) . Videos are spatiotemporal data and spatiotemporal analysis is the key to many video content understanding tasks. A good video representation learned from the self-supervised pretext task should therefore capture discriminative information jointly along both spatial and temporal dimensions. It is thus somewhat counter-intuitive to note that most existing SSL pretext tasks for videos do not explicitly require joint spatiotemporal video understanding. For example, some spatial pretext tasks have been borrowed from images without any modification (Jing et al., 2018) , ignoring the temporal dimension. On the other hand, many recent video-specific pretext tasks typically involve speed or temporal order prediction (Lee et al., 2017; Wei et al., 2018; Benaim et al., 2020; Wang et al., 2020) , i.e., operating predominately along the temporal axis. A natural choice for a spatiotemporal pretext task is to solve 3D jigsaw puzzles, whose 2D counterpart has been successfully used for images (Noroozi & Favaro, 2016) . Indeed, solving 3D puzzles requires the learned model to understand spatiotemporal continuity, a key step towards video content understanding. However, directly solving a 3D puzzle turns out to be intractable: a puzzle of 3×3×3 pieces (the same size as a Rubik's cube) can have 27! possible permutations. Video volume even in a short clip is much larger than that. Nevertheless, the latest neural sorting models (Paumard et al., 2020; Du et al., 2020) can only handle permutations a few orders of magnitude less, so offer no solution. This is hardly surprising because such a task is daunting even for humans: Most people would struggle with a standard Rubik's cube, let alone a much larger one. In this paper, we propose a novel Constrained Spatiotemporal Jigsaw (CSJ) pretext task for selfsupervised video representation learning. The key idea is to form 3D jigsaw puzzles in a constrained manner so that it becomes solvable. This is achieved by factorizing the permutations (shuffling) into the three spatiotemporal dimensions and then applying them sequentially. This ensures that for a given video clip, large continuous spatiotemporal cuboids exist after the constrained shuffling to provide sufficient cues for the model to reason about spatiotemporal continuity (see Fig. 1(b)(c) ). Such large continuous cuboids are also vital for human understanding of video as revealed in neuroscience and visual studies (Stringer et al., 2006; Chen et al., 2019) . Even with the constrained puzzles, solving them directly could still be extremely hard. Consequently, instead of directly solving the puzzles (i.e., recovering the permutation matrix so that each piece can be put back), four surrogate tasks are carefully designed. They are more solvable but meanwhile still ensure that the learned representation is sensitive to spatiotemporal continuity at both the local and global levels. Concretely, given a video clip shuffled with our constrained permutations, we make sure that the top-2 largest continuous cuboids (LCCs) dominate the clip volume. The level of continuity in the shuffle clip as a whole is thus determined mainly by the volumes of these LCCs, and whether they are at the right order (see Fig. 1(d )(e)) both spatially and temporally. Our surrogate tasks are thus designed to locate these LCCs and predict their order so that the model learned with these tasks can be sensitive to spatiotemporal continuity both locally and globally. Our main contributions are three-fold: (1) We introduce a new pretext task for self-supervised video representation learning called Constrained Spatiotemporal Jigsaw (CSJ). To our best knowledge, this is the first work on self-supervised video representation learning that leverages spatiotemporal jigsaw understanding. (2) We propose a novel constrained shuffling method to construct easy 3D jigsaws containing large LCCs. Four surrogate tasks are then formulated in place of the original jigsaw solving tasks. They are much more solvable yet remain effective in learning spatiotemporal discriminative representations. (3) Extensive experiments show that our approach achieves state-ofthe-art on two downstream tasks across various benchmarks.

2. RELATED WORK

Self-supervised Learning with Pretext Tasks Self-supervised learning (SSL) typically employs a pretext task to generate pseudo-labels for unlabeled data via some forms of data transformation. According to the transformations used by the pretext task, existing SSL methods for video presentation learning can be divided into three categories: (1) Spatial-Only Transformations: Derived from the original image domain (Gidaris et al., 2018) , Jing et al. (2018) leveraged the spatial-only transformations for self-supervised video presentation learning. (2) Temporal-Only Transformations: Misra et al. (2016); Fernando et al. (2017) ; Lee et al. (2017) ; Wei et al. (2018) obtained shuffled video frames with the temporal-only transformations and then distinguished whether the shuffled frames are in chronological order. Xu et al. (2019) chose to shuffle video clips instead of frames. Benaim et al. (2020) ; Yao et al. (2020) ; Jenni et al. (2020) exploited the speed transformation via determining whether one video clip is accelerated. (3) Spatiotemporal Transformations: There are only a few recent approaches (Ahsan et al., 2019; Kim et al., 2019) that leveraged both spatial and temporal transformations by permuting 3D spatiotemporal cuboids. However, due to the aforementioned intractability of solving the spatiotemporal jigsaw puzzles, they only leveraged either temporal or spatial permutations as training signals, i.e., they exploited the two domains independently. Therefore, no true spatiotemporal permutations have been considered in Ahsan et al. (2019) ; Kim et al. (2019) . In contrast, given that both spatial appearances and temporal relations are important cues for video representation learning, the focus of this work is on investigating how to exploit the spatial and temporal continuity jointly for self-supervised video presentation learning. To that end, our Constrained Spatiotemporal Jigsaw (CSJ) presents the first spatiotemporal continuity based pretext task for video SSL, thanks to a novel constrained 3D jigsaw and four surrogate tasks to reason about the continuity in the 3D jigsaw puzzles without solving them directly. Self-supervised Learning with Contrastive Learning Contrastive learning is another selfsupervised learning approach that has become increasingly popular in the image domain (Misra & Maaten, 2020; He et al., 2020; Chen et al., 2020) . Recently, it has been incorporated into video SSL as well. Contrastive learning and transformation based pretext tasks are orthogonal to each other and often combined in that different transformed versions of a data sample form the positive set used in contrastive learning. In El-Nouby et al. (2019) ; Knights et al. (2020) ; Qian et al. (2020) ; Wang et al. (2020) ; Yang et al. (2020) , the positive/negative samples were generated based on temporal transformations only. In contrast, some recent works (Han et al., 2019; 2020; Zhuang et al., 2020) leveraged features from the future frame embeddings or with the memory bank (Wu et al., 2018) . They modeled spatiotemporal representations using only contrastive learning without transformations. Contrastive learning is also exploited in one of our surrogate pretext tasks. Different from existing works, we explore the spatiotemporal transformations in the form of CSJ and employ contrastive learning to distinguish different levels of spatiotemporal continuity in shuffled jigsaws. This enables us to learn more discriminative spatiotemporal representations.

3.1. PROBLEM DEFINITION

The main goal of self-supervised video representation learning is to learn a video feature representation function f (•) without using any human annotations. A general approach to achieving this goal is to generate a supervisory signal y from an unlabeled video clip x and construct a pretext task P to predict y from f (x). The process of solving the pretext task P encourages f (•) to learn discriminative spatiotemporal representations. The pretext task P is constructed typically by applying to a video clip a transformation function t(•; θ) parameterized by θ and then automatically deriving y from θ, e.g., y can be the type of the transformation. Based on this premise, P is defined as the prediction of y using the feature map of the transformed video clip f ( x), i.e., P : f ( x) → y, where x = t(x; θ). For example, in Lee et al. (2017) , t(•; θ) denotes a temporal transformation that permutes the four frames of video clip x in a temporal order θ, x = t(x; θ) is the shuffled clip, and the pseudo-label y is defined as the permutation order θ (e.g., 1324, 4312, etc.) . The pretext task P is then a classification problem of 24 categories because there are 4! = 24 possible orders.

3.2. CONSTRAINED PERMUTATIONS

Solving spatiotemporal video jigsaw puzzles seems to be an ideal pretext task for learning discriminative representation as it requires an understanding of spatiotemporal continuity. After shuffling the pixels in a video clip using a 3D permutation matrix, the pretext task is to recover the permutation matrix. However, as explained earlier, this task is intractable given even moderate video clip sizes. Our solution is to introduce constraints on the permutations. As a result, a new pretext task P CSJ based on Constrained Spatiotemporal Jigsaw (see Fig. 2(a) ) is formulated, which is much easier to solve than a random/unconstrained jigsaw. Specifically, our goal is to introduce constraints to the permutations so that the resultant shuffled video clip is guaranteed to have large continuous cuboids (see Fig. 2(a) ). Similar to humans (Stringer et al., 2006) , having large continuous cuboids is key for a model to understand a 3D jigsaw and therefore to have any chance to solve it. Formally, the volume of a shuffled video clip x are denoted as {T, H, W }, measuring its sizes along the temporal, height, and width dimensions, respectively. A cuboid is defined as a crop of x: c = x t1:t2,h1:h2,w1:w2 , where {1, 2, . . . , H}, w 1 , w 2 ∈ {1, 2, . . . , W }. If all the jigsaw pieces (smallest video clip unit, e.g. a pixel or a 3D pixel block) in c keep the same relative order as they were in x (before being shuffled), we call the cuboid c as a continuous cuboid c cont . The cuboid's volume equals (t 2 -t 1 ) × (h 2 -h 1 ) × (w 2 -w 1 ), and the largest continuous cuboid (LCC) c cont max is the c cont with the largest volume. We introduce two permutation strategies to ensure that the volumes of LCCs are large in relation to the whole video clip volume after our shuffling transformation t(•; θ CSJ ). First, instead of shuffling x in three spatiotemporal dimensions simultaneously, t(•; θ CSJ ) factorizes the permutations into the three spatiotemporal dimensions and then utilizes them sequentially to generate shuffled clips, e.g., in the order of T, W, H and only once. Note that the volume of the generated x stays the same with different permutation orders (e.g., T W H and HT W ). Second, we shuffle a group of jigsaw pieces together instead of each piece individually along each dimension. Taking spatial shuffling as an example, if there are 8 pieces per frame (along each of the two spatial dimensions), θ CSJ could be represented as the permutation from {12345678} to {84567123}. The longest and the secondlongest index ranges are: [2, 5] for coordinates {4567}, and [6, 8] for coordinates {123}. With these two permutation strategies, not only do we have large LCCs, but also they are guaranteed to have clearly separable boundaries (see Fig. 2(b) ) with surrounding pieces due to the factorized and grouped permutation design. This means that they are easily detectable. t 1 , t 2 ∈ {1, 2, . . . , T }, h 1 , h 2 ∈ x …… f ( .

3.3. SURROGATE TASKS

Having permutation constraints preserves more spatiotemporal continuity in the shuffled clip and reduces the amount of possible permutations. But exploiting these constraints to make a neural sorting model tractable is still far from trivial. Instead of solving the jigsaw directly, our P CSJ is thus formulated as four surrogate tasks: Largest Continuous Cuboid Detection (LCCD), Clip Shuffling Pattern Classification (CSPC), Contrastive Learning over Shuffled Clips (CLSC), and Clip Continuity Measure Regression (CCMR). As illustrated in Fig. 2(b) , given an unlabeled clip x, we first construct a mini-batch of 8 clips { x 1 , x 2 , ..., x 8 } by shuffling x with different but related constrained permutations (to be detailed later). These shuffled clips and the raw clip x are then fed into a 3D CNN model f (•) for spatiotemporal representation learning with a non-local operation (Wang et al., 2018) : f NL ( x i ) = NL(f ( x i ), f (x)), where NL(•, •) denotes the non-local operator, and f ( x i ) and f (x) denote the feature map of x i and x from the last convolutional layer of f (•), respectively. The resultant feature map f NL ( x i ) is further passed through a spatial pooling layer followed by a separately fully-connected layer for each surrogate task. Note that the raw video feature map f (x) is used as guidance through the nonlocal based attention mechanism to help fulfill the tasks. This is similar to humans needing to see the completed jigsaw picture to help solve the puzzle. Before we detail the four tasks, we first explain how the eight permutations from the same raw clip are generated. First, the factorized and grouped permutations are applied to x to create one shuffled clip. By examining the largest and the second-largest continuous puzzle piece numbers of each dimension ({T, H, W }), we can easily identify the top-2 largest continuous cuboids (LCCs). Next, by varying the relative order of the top-2 LCCs either in the correct (original) order or the reverse order in each dimension, 2×2×2=8 permutations are obtained. By controlling the group size in permutation, we can make sure that the top-2 LCCs account for a large proportion, saying 80% of the total clip volume. Our four tasks are thus centered around these two LCCs as they largely determine the overall spatiotemporal continuity of the shuffled clip. The first task LCCD is to locate the top-2 LCCs {c cont max (j) : j = 1, 2} and formulated as a regression problem. Given a ground-truth LCC c cont max (j), a Gaussian kernel is applied to its center to depict the possibility of each pixel in x belonging to the LCC. This leads to a soft mask M j LCCD with the same size of x: M j LCCD is all 0 outside the region of c cont max (j), and exp(- ||a -a c || 2 2σ 2 g ) inside the region, where a, a c denote any pixel and the center point, respectively. σ g is the hyper-parameter which is set as 1 empirically. In the training stage, FPN (Lin et al., 2017) is used for multi-level feature fusion. LCCD is optimized using the MSE loss in each point: L LCCD = j∈{1,2} a∈ x MSE(M j LCCD (a), M j LCCD (a) ), where MSE(•, •) denotes the MSE loss function, and M j LCCD (a) is the prediction of each pixel a. CSPC is designed to recognize the shuffling pattern of a shuffled clip. As mentioned early, the eight shuffled clips in each mini-batch are created from the same raw clip and differ only in the relative order of the top-2 LCCs along each of the three dimensions. There are thus eight permutations depending on the order (correct or reverse) in each dimension. Based on this understanding, CSPC is formulated as a multi-class classification task to recognize each shuffled clip into one of these eight classes, which is optimized using the Cross-Entropy (CE) loss: L CSPC = i∈{0,1,...,7} CE(l CSPC [i], l CSPC [i]), where CE(•, •) denotes the CE loss function and l CSPC [i] is the predicted class label of i-th sample (shuffled clip) in each mini-batch. The two tasks above emphasize on local spatiotemporal continuity understanding. In contrast, CLSC leverages the contrastive loss to encourage global continuity understanding. In particular, since the top-2 LCCs dominate the volume of a clip, it is safe to assume that if their relative order is correct in all three dimensions, the shuffled clip largely preserve continuity compared to the original clip, while all other 7 permutations feature large discontinuity in at least one dimension. We thus form a contrastive learning task with the original video x and the most continuous shuffled video x i as a positive pair, and x and the rest x j (j = i) as negative pairs. CLSC is optimized using the Noise Contrastive Estimation (NCE) (Tian et al., 2020) loss: L CLSC = -log exp(sim(f (x), f ( x i ))/τ ) exp(sim(f (x), f ( x i ))/τ ) + j exp(sim(f (x), f ( x j ))/τ ) , where sim(•, •) is defined by the dot product: f (x) f ( x i ), and τ is the temperature hyper-parameter. Note that the non-local operator is not used in CLSC. CCMR is similar to CLSC in that it also enforces global continuity understanding, but differs in that it is a regression task aimed at predicting a global continuity measure. We consider two such measures. Since the total size of the top-2 LCCs {c cont max (j) : j = 1, 2} is a good indicator of how continuous a shuffle video clip is, the first measure l ld directly measures the relative total size of the top-2 LCCs: l ld = v(c cont max (1)) + v(c cont max (2)) v( x) , where v(•) represents the volume of a clip/cuboid. The second measure l t/h/w hd examines the shuffling degree of x in each dimension, computed as the normalized hamming distance: hamming( x) N c (N c -1)/2 , where hamming(•) denotes the hamming distance in each dimension between the original piece sequence and the permuted one, and N c represents the number of pieces in each dimension so that N c (N c -1)/2 indicates the maximum possible hamming distance in the dimension. CCMR is optimized using the Mean Squared Error (MSE) loss: L CCMR = MSE([l ld , l t hd , l h hd , l w hd ], [l ld , l t hd , l h hd , l w hd ]), where l ld , l t hd , l h hd , l w hd are the prediction of the model.

3.4. OVERALL LEARNING OBJECTIVE

Our entire CSJ framework is optimized end-to-end with the learning objective defined as: L = σ 1 L LCCD + σ 2 L CSPC + σ 3 L CLSC + σ 4 L CCMR , where σ 1 , σ 2 , σ 3 , σ 4 denote the weights for the four losses. We deploy the adaptive weighting mechanism (Kendall et al., 2018) to weight these tasks, and thus there is no free hyper-parameters to tune. We also adopt curriculum learning (Bengio et al., 2009; Korbar et al., 2018) to train our network by shuffling clips from easy to hard. More details are presented in Appendix. A.1 and A.2.

4.1. DATASETS AND SETTINGS

We select three benchmark datasets for performance evaluation: UCF101 (Soomro et al., 2012) , HMDB51 (Kuehne et al., 2011) , and Kinetics-400 (K400) (Kay et al., 2017) , containing 13K/7K/306K video clips from 101/51/400 action classes, respectively. In the self-supervised pretraining stage, we utilize the first training split of UCF101/HMDB51 and the training split of K400 without using their labels. As in Han et al. (2020) , we adopt R2D3D as the backbone network, which is modified from R3D (Hara et al., 2018) with fewer parameters. By fine-tuning the pre-trained model, we can evaluate the SSL performance on a downstream task (i.e., action classification). Following Han et al. (2019) ; He et al. (2020) , two evaluation protocols are used: comparisons against state-of-the-arts follow the more popular fully fine-tuning evaluation protocol, but ablation analysis takes both the linear evaluation and fully fine-tuning protocols. For the experiments on supervised learning, we report top-1 accuracy on the first test split of UCF101/HMDB51 as the standard (Han et al., 2020) . More details of the datasets are provided in Appendix B.

4.2. IMPLEMENTATION DETAILS

Raw videos in these datasets are decoded at a frame rate of 24-30 fps. From each raw video, we start from a randomly selected frame index and sample a consecutive 16-frame video clip with a temporal stride of 4. For data augmentation, we first resize the video frames to 128×171 pixels, from which we extract random crops of size 112×112 pixels. We also apply random horizontal flipping and random color jittering to the video frames during training. We exploit only the raw RGB video frames as input, and do not leverage optical flow or other auxiliary signals for self-supervised pretraining. We adopt the Adam optimizer with a weight decay of 10 -3 and a batch size of 8 per GPU (with a total of 32 GPUs). We deploy cosine annealing learning rate with an initial value of 10 -4 and 100 epochs. The jigsaw puzzle piece sizes of {T, H, W } dimensions are set as 1, 4, 4, respectively. A 16×112×112 video clip thus contains 16×28×28 pieces. We set the temperature hyper-parameter τ to 0.07. A dropout of 0.5 is applied to the final layer of each task. More implementation details of the fine-tuning and test evaluation stages can be found in Appendix B.

Comparison in Action Recognition

A standard way to evaluate a self-supervised video representation learning model is to use it to initialize an action recognition model on a small dataset. Specifically, after self-supervised pre-training on UCF101/HMDB51/K400, we exploit the learned backbone for fully fine-tuning on UCF101 and HMDB51, following Han et al. (2020) ; Wang et al. (2020) . We consider one baseline: fully-supervised learning with pre-training on K400. Note that this baseline is commonly regarded as the upper bound of self-supervised representation learning (Alwassel et al., 2019) . From Table 1 , we have the following observations: (1) Our CSJ achieves state-of-theart performance on both UCF101 and HMDB51. Particularly, with the backbone R2D3D-18 that is weaker than R(2+1)D-18, our CSJ performs comparably w.r.t. Pace on UCF101 but achieves a 10% improvement over Pace on HMDB51. (2) By exploiting spatiotemporal transformations for self-supervised representation learning, our CSJ beats either methods with only temporal transformations ( †) or methods with both spatial and temporal transformations ( ‡), as well as those learning spatiotemporal representations ( * ) via only contrastive learning (w./o. spatiotemporal transformations). (3) Our CSJ also outperforms CBT (Sun et al., 2019) , which used ten-times more massive datasets (K600 (Carreira et al., 2018 ) + Howto100M (Miech et al., 2019) ) and multiple modalities (RGB+Audio). (4) Our CSJ is the closest to the fully-supervised one (upper bound), validating its effectiveness in self-supervised video representation learning.

Comparison in Video Retrieval

We evaluate our CSJ method in the video retrieval task. 2 show that our method outperforms all other self-supervised methods and achieves new state-of-the-art in video retrieval on UCF101. Particularly, our method beats the latest competitor The first row denotes the raw frames from videos, and the last two rows correspond to fine-tuning from random initialization and our self-supervised pre-trained model, respectively. PRP (Yao et al., 2020) on four out of five metrics. This indicates that our proposed CSJ is also effective for video representation learning in video retrieval.

4.4. FURTHER EVALUATIONS

Ablation Study We conduct ablative experiments to validate the effectiveness of four CSJ surrogate tasks and two additional learning strategies. From Table 3 , we can observe that: (1) Selfsupervised learning with each of the four tasks shows better generalization than fine-tuning the network from scratch (random initialization). ( 2) By training over all the four tasks jointly, we can achieve large performance gains (see '+LCCD' vs. 'CCMR'). (3) Each additional learning strategy (i.e., adaptive weighting or curriculum learning) leads to a small boost to the performance by 0.3-0.5%. (4) Our full model achieves a remarkable classification accuracy of 70.4%, demonstrating the effectiveness of our proposed CSJ with only the RGB video stream (without additional optical flow, audio, or text modalities). More ablative analysis can be found in Appendix D. Visualization of Attention Maps Fig. 3 visualizes the attention map of the last feature maps from two models fine-tuned on UCF101 with or without adopting our self-supervised pre-training. Since each frame's attention map involves four adjacent frames, it actually contains spatiotemporal semantic features. We can see that our self-supervised pre-training with CSJ indeed helps to better capture meaningful spatiotemporal information and thus recognize the action categories more correctly. 

Visualization of LCCD Predictions

We also demonstrate the visualization of the LCCD predictions from the pre-trained models in Fig. 4 . We can observe that solving the LCCD task indeed enables the model to learn the locations of LCCs and understand spatiotemporal continuity, which is a key step towards video content understanding.  res 4 3 × 3 2 , 256 3 × 3 2 , 256 × 2 8 × 7 2 × 256 res 5 3 × 3 2 , 512 3 × 3 2 , 512 × 2 4 × 4 2 × 512 Avgpool 4 × 4 2 , 512 stride 1, 1 2 1 × 1 2 × 512 self-supervised pre-training stage, except that the total epochs are 300 and the initial learning rate is 10 -3 . We use a batch size of 64 per GPU and a total of 8 GPUs for fine-tuning. We follow the standard evaluation protocol (Han et al., 2020) during inference and use ten-crop to take the same sequence length as training from the video. The predicted label of each video is calculated by averaging the softmax probabilities of all clips in the video.

C NETWORK ARCHITECTURE

We deploy the same network backbone R2D3D as Han et al. (2019; 2020) , which is a 3D-ResNet (R3D) similar to Hara et al. (2018) . The only difference between R2D3D and R3D lies in that: R2D3D keeps the first two residual blocks as 2D convolutional blocks while R3D uses 3D blocks. Therefore, the modified R2D3D has fewer parameters (only the last two blocks are 3D convolutions). We present the CNN structure of R2D3D in Table 4 . Instead of predicting center points using the detection method, we also design a segmentation method -largest continuous cuboid segmentation (LCCS) to predicts the location of top-2 LCCs {c cont max (j) : j = 1, 2}. The difference between LCCD and LCCS lies in that: LCCS is formulated as a segmentation task to discriminate whether a pixel is in the region of c cont max (j). Concretely, LCCS predicts a binary mask M j LCCS where only points in the region of {c cont max (j) are set to be 1, otherwise 0. As a result, LCCS is optimized using the Cross Entropy (CE) loss at each point:

D ADDITIONAL ABLATION STUDIES

L LCCS = j∈{1,2} a∈ x CE(M j LCCS (a), M j LCCS (a) ), where CE(•, •) denotes the CE loss function, and M j LCCS (a) is the predicted class of pixel a. We report the performance of four different designs of LCCD in Table 5 : (1) LCCS: LCCS is used instead of LCCD. (2) LCCD+M LCCS : The Gaussian mask M LCCD is substituted by the binary mask M LCCS , but the LCCD task is optimized using the MSE loss. (3) LCCD + L1: The LCCD task is optimized by the L1 loss. (4) LCCD + MSE: The LCCD task is optimized by the MSE loss. From Table 5 , it can be seen that the segmentation task also helps self-supervised representation learning but doesn't perform as well as LCCD. Also, under the three different settings of LCCD, the MSE loss with the Gaussian map performs the best. We report the performance of three different designs of CCMR: (1) ld: the learning degree l ld is used as supervision, which only contains volume information. (2) hd: the hamming distances l t hd , l h hd , l w hd are used, which contain only the relative order information. (3) ld + hd: both ld and hd are used as supervision. From Table 8 , we can see that: First, both ld and hd help the model to learn continuous characteristics during pre-training, and hd outperforms ld by a small margin. Second, our CCMR learns the best representation by combining ld and hd.

D.5 RESULTS OF DIRECTLY SOLVING CSJ

We also demonstrate the results of solving the CSJ task directly in Table 9 . We randomly shuffle video clips into 4 × 4 × 4 jigsaw puzzles. To recognize the correct permutation, the model solve a (4! × 4! × 4!)-way classification task in the pre-training stage. We compare the CSJ task with the joint LCCD+CCMR task under the same setting for fair comparison. Linear evaluation is adopted to show the effectiveness of different tasks. We can observe from the table that solving LCCD+CCMR jointly is more effective than solving CSJ directly.

E TEMPORAL ACTION SEGMENTATION

To show the effectiveness of our CSJ for solving new downstream tasks, we apply the pretrained model obtained by our CSJ to temporal action segmentation, which is more challenging than the conventional action recognition and retrieval tasks. Specifically, we choose to compare our CSJ model with the latest competitor MemDPC (Han et al., 2020) on the Breakfast dataset (Kuehne et al., 2014) . For fair comparison, our CSJ model and the MemDPC model adopt the same R2D3D-34 backbone. Due the time constraint, from the original Breakfast dataset, we only use a small subset of 200 long videos as the training set for fine-tuning, and select a few long videos for the test. For temporal action segmentation, we follow the overall framework of MS-TCN (Abu Farha & Gall, 2019) , but changes its backbone to R2D3D-34 pretrained by our CSJ or MemDPC. We present the qualitative results on two test videos in Fig. 5 . We can clearly observe that our CSJ outperforms MemDPC on both test videos. Particularly, the predictions of our CSJ are much closer to the ground truth, but MemDPC tends to produce unwanted segments for temporal action segmentation: it wrongly recognizes the segment (color in yellow) in the middle part of the first video as 'Pour Milk', and the segment (color in black) in the last part of the second video as 'Stir Coffee'. In conclusion, as compared to the latest SSVRL method MemDPC, our CSJ can learn more robust features for temporal action segmentation due to its 'true' spatiotemporal jigsaw understanding.



CONCLUSIONWe have introduced a novel self-supervised video representation learning method named Constrained Spatiotemporal Jigsaw (CSJ). By introducing constrained permutations, our proposed CSJ is the first to leverage spatiotemporal jigsaw in self-supervised video representation learning. We also propose four surrogate tasks based on our constrained spatiotemporal jigsaws. They are designed to encourage a video representation model to understand the spatiotemporal continuity, a key building block towards video content analysis. Extensive experiments were carried out to validate the effectiveness of each of the four CSJ tasks and also show that our approach achieves the state-of-the-art on two downstream tasks across various benchmarks.



Figure 1: Illustration of our constrained jigsaw and the surrogate pretext tasks using an image example (only spatial for clarity). Our constrained jigsaw can be easily extended to the spatiotemporal domain as done in this work. (a): The raw image. (b),(c): Comparing an unconstrained puzzle (b) and our constrained one (c), it is clear that ours is much more continuous (hence interpretable) reflected by the size of the largest continuous cuboids (LCCs, rectangles in images here) shown in red. (d),(e): Illustration of the importance of the relative order of the top-2 LCCs for determining the global continuity level of the shuffled image. (d) and (e) have the same top-2 LCCs, but only (d) keeps the correct relative order between them. Locating these LCCs and predicting their relative order are thus the key objectives of our surrogate tasks.

Figure 2: (a) Illustration of our Constrained Spatiotemporal Jigsaw (CSJ) (see Sec. 3.2). (b) The pipeline of our proposed framework for self-supervised video representation learning (see Sec. 3.3). A raw video clip is transformed into 8 shuffled clips with our Constrained Spatiotemporal Jigsaw (CSJ), and a 3D CNN sharing weights extracts the feature representations from them. The model is then trained by solving four self-supervised tasks jointly.

Following Xu et al. (2019), we extract each video clips' embeddings with the pre-training model and use each clip in the test set to query the k nearest clips in the training set. The comparative results in Table

Figure 3: Attention visualization of the last feature maps from the fine-tuned models on UCF101.The first row denotes the raw frames from videos, and the last two rows correspond to fine-tuning from random initialization and our self-supervised pre-trained model, respectively.

Figure 4: Visualization of the LCCD predictions from the pre-trained models. Each row denotes the frames at time stamp = (0, 4, 8, 12) from one video clip. (a) raw frames (with color jittering); (b) shuffled frames; (c) the ground truth of LCCD; (d) network's prediction.

Figure 5: Qualitative results for the temporal action segmentation task on the Breakfast dataset. Note that the notation ∅ denotes an unannotated segment in the ground truth.

Comparison to the state-of-the-art on UCF101(U) and HMDB51(H). All models are pretrained with the RGB modality only. †: Methods with temporal-only transformations. ‡: Methods with both spatial and temporal transformations. * : Methods that leverage spatiotemporal representations. HT: HowTo100M. The underline represents the second-best result.

Comparison with state-of-the-art self-supervised learning methods for nearest neighbor video retrieval (top-k recall) on UCF101. The underline represents the second-best result.

Evaluation of pre-training tasks with the backbone R2D3D-18 under linear probe and fully fine-tuning protocols on UCF101. AW: Adaptive Weighting. CL: Curriculum Learning.



The structure of the encoding function f (•). R2D3D-18 is used as an example.

Evaluation of pre-training tasks under different designs of LCCD on UCF101.

Evaluation of different temperature τ for CLSC on UCF101.

above shows the accuracies obtained with different temperatures τ used in contrastive learning. We can observe that: (1) When τ is in the range 1 ∼ 0.07, the accuracy increases with smaller τ . (2) When τ is large (e.g., 1), the accuracy drops considerably. In this work, τ is set to 0.0.

Evaluation of different designs of CSPC on UCF101. Categories: the shuffled clip is discriminated by whether it has the same relative order of the top-2 LCCs as the raw clip. It is almost the same as CLSC but is optimized by the CE loss. (2) 4 Categories: the shuffled clip is discriminated by how it differs from the raw clip: non-difference, spatial-only difference, temporal-only difference, spatiotemporal difference. From Table7, we can see that CSPC with 8 categories outperforms the other two designs. These results support our motivation for leveraging spatiotemporal transformations.

Evaluation of different designs of CCMR on UCF101.

Evaluation of pre-training tasks with the backbone R2D3D-18 under the linear evaluation protocol on UCF101. For computation efficiency, CSJ is only defined on 4 × 4 × 4 cells. Random Initialization CSJ LCCD+CCMR (4 × 4 × 4) LCCD+CCMR (16 × 28 × 28)



A ADDITIONAL LEARNING STRATEGIES

A.1 ADAPTIVE WEIGHT Formally, our CSJ has two continuous outputs y 1 , y 4 from LCCD and CCMR, and two discrete outputs y 2 , y 3 from CSPC and CLSC, modeled with Gaussian likelihoods and softmax likelihoods, respectively. The joint loss for these four tasks L(W, σ 1 , σ 2 , σ 3 , σ 4 ) is:where σ is the weight factor that can be automatically learned from the network, and the log likelihood for the output y is defined as:A.2 CURRICULUM LEARNINGWe adopt curriculum learning (Korbar et al., 2018) to train our network by shuffling clips from easy to hard. Let d be the shuffle degree of a shuffled clip x, representing the number of continuous cuboids in each dimension. We gradually increase d from 3 to 5 during the training phase to produce more permuted clips. Note that when the video content is ambiguous in one dimension, e.g., a static video clip inflated from an image, there is no temporal variance to learn the transformation. Kim et al. (2019) ; Noroozi & Favaro (2016) also mentioned this problem as similar-looking ambiguity.To solve this problem, we calculate the variance on each dimension and set a threshold. If the variance is lower than the threshold, we decrease d from 3 to 1 so that the pieces are not shuffled in the corresponding dimension. Kinetics-400 (K400) (Kay et al., 2017 ) is a very large action recognition dataset consisting of 400 human action classes and around 306k videos. In this work, we use the training split of K400 as the pre-training dataset.

B.2 IMPLEMENTATION DETAILS

In the fine-tuning stage, weights of convolutional layers are initialized with self-supervised pretraining, but weights of fully-connected layers are randomly initialized. The whole network is then trained with the cross-entropy loss. The pre-processing and training strategies are the same as in the

