TEMPCLR: TEMPORAL ALIGNMENT REPRESENTA-TION WITH CONTRASTIVE LEARNING

Abstract

Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level comparison may ignore global temporal context, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal succession by shuffling video clips w.r.t. temporal granularity. Then, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design.

1. INTRODUCTION

Representation learning on videos has achieved success (Goroshin et al., 2015; Feichtenhofer et al., 2021) in detecting actions in short periods. Recent work has extended it on video-text data (Miech et al., 2019; Radford et al., 2021) to learn a common feature space for zero-shot transfer. In particular, given a paragraph of description, the understanding of long videos is increasingly important and may facilitate AI-assistant applications (Grauman et al., 2022; Lin et al., 2022; Chen et al., 2022) . A long video is usually formulated as a sequence of short video clips. Given a paragraph, each sentence is used to describe, i.e., paired with, the consecutive video clips in a video segment. By matching all sentence-clip pairs (Miech et al., 2020) , the full video and the paragraph can be aligned implicitly. However, maximizing the agreement between clips and sentences individually (unitlevel) ignores the context of temporal dynamics, which limits the generalization (Goyal et al., 2017) . After all, within one video segment, as the action/event progress at each clip varies, the similarity between the clips and the sentence can be naturally different. As such, strictly aligning the sentence with all paired clips, serving as the hard-label, may not always result in an optimal solution. To incorporate the temporal correlation across clips, Xu et al. (2021) propose to first fuse the representations over a short period for sentences and video clips separately and then align the fused representations. However, such methods only incorporate the local temporal information but still does not model the global temporal correlation. As a paragraph is essentially a sequence of sentences, as shown in Fig. 1 , the whole long video and the paragraph should be explicitly compared and aligned (sequence-level). For a video consisting of multiple steps, e.g., instructional video, the † Equal contribution. Code Link: https://github.com/yyuncong/TempCLR 

