DISCOVERING INFORMATIVE AND ROBUST POSITIVES FOR VIDEO DOMAIN ADAPTATION

Abstract

Unsupervised domain adaptation for video recognition is challenging where the domain shift includes both spatial variations and temporal dynamics. Previous works have focused on exploring contrastive learning for cross-domain alignment. However, limited variations in intra-domain positives, false cross-domain positives, and false negatives hinder contrastive learning from fulfilling intra-domain discrimination and cross-domain closeness. This paper presents a non-contrastive learning framework without relying on negative samples for unsupervised video domain adaptation. To address the limited variations in intra-domain positives, we set unlabeled target videos as anchors and explored to mine "informative intra-domain positives" in the form of spatial/temporal augmentations and target nearest neighbors (NNs). To tackle the false cross-domain positives led by noisy pseudo-labels, we reversely set source videos as anchors and sample the synthesized target videos as "robust cross-domain positives" from an estimated target distribution, which are naturally more robust to the pseudo-label noise. Our approach is demonstrated to be superior to state-of-the-art methods through extensive experiments on several cross-domain action recognition benchmarks.

1. INTRODUCTION

Recent breakthroughs in deep neural networks have transformed numerous computer vision tasks, including tasks such as image and video recognition (He et al., 2016; Carreira & Zisserman, 2017a; Mittal et al., 2020) . Nevertheless, achieving such remarkable results typically necessitates timeconsuming human annotations. To address this issue, semi-supervised learning (Miyato et al., 2018) and self-supervised learning (SSL) (He et al., 2020) have been studied to utilize the knowledge available in a dataset with abundant labeled samples to improve the performance of models trained on datasets with scarce labeled data. However, the domain shift problem between the source and target datasets usually exists in real-world scenarios, leading to performance degradation. Unsupervised domain adaption (UDA) has been exploited to transfer knowledge across datasets with domain discrepancies to mitigate this problem. Although many methods have been created specifically for images, there is still a significant lack of exploration in the field of UDA for videos. Recently, some studies have endeavored to perform UDA for video action recognition through the direct alignment of frame/clip-level features (Chen et al., 2019a; Pan et al., 2020a; Choi et al., 2020) . However, these methods usually extend the image-based UDA methods without considering longterm temporal information or action semantics. Song et al. (2021) and Kim et al. (2021b) alleviate these issues with contrastive learning to learn such long-term spatial-temporal representations by instance discrimination. To further understand how contrastive learning helps UDA, we firstly recall that domain-wise discrimination and class-wise closeness are the two main criteria to solve UDA problems (Shi & Sha, 2012; Tang et al., 2020) . Considering an unlabeled video from the target domain as an anchor, we explain that Song et al. ( 2021 



) introduces intra-domain positives (in the form of cross-modal correspondence, e.g, optical flow) to help UDA by learning discriminative representation in the target domain. Additionally, Kim et al. (2021b) utilizes cross(source)-domain positives with the help of pseudo-labels and benefits the class-wise closeness by pushing the samples of the same class but different domains closer.

