DISCOVERING INFORMATIVE AND ROBUST POSITIVES FOR VIDEO DOMAIN ADAPTATION

Abstract

Unsupervised domain adaptation for video recognition is challenging where the domain shift includes both spatial variations and temporal dynamics. Previous works have focused on exploring contrastive learning for cross-domain alignment. However, limited variations in intra-domain positives, false cross-domain positives, and false negatives hinder contrastive learning from fulfilling intra-domain discrimination and cross-domain closeness. This paper presents a non-contrastive learning framework without relying on negative samples for unsupervised video domain adaptation. To address the limited variations in intra-domain positives, we set unlabeled target videos as anchors and explored to mine "informative intra-domain positives" in the form of spatial/temporal augmentations and target nearest neighbors (NNs). To tackle the false cross-domain positives led by noisy pseudo-labels, we reversely set source videos as anchors and sample the synthesized target videos as "robust cross-domain positives" from an estimated target distribution, which are naturally more robust to the pseudo-label noise. Our approach is demonstrated to be superior to state-of-the-art methods through extensive experiments on several cross-domain action recognition benchmarks.

1. INTRODUCTION

Recent breakthroughs in deep neural networks have transformed numerous computer vision tasks, including tasks such as image and video recognition (He et al., 2016; Carreira & Zisserman, 2017a; Mittal et al., 2020) . Nevertheless, achieving such remarkable results typically necessitates timeconsuming human annotations. To address this issue, semi-supervised learning (Miyato et al., 2018) and self-supervised learning (SSL) (He et al., 2020) have been studied to utilize the knowledge available in a dataset with abundant labeled samples to improve the performance of models trained on datasets with scarce labeled data. However, the domain shift problem between the source and target datasets usually exists in real-world scenarios, leading to performance degradation. Unsupervised domain adaption (UDA) has been exploited to transfer knowledge across datasets with domain discrepancies to mitigate this problem. Although many methods have been created specifically for images, there is still a significant lack of exploration in the field of UDA for videos. Recently, some studies have endeavored to perform UDA for video action recognition through the direct alignment of frame/clip-level features (Chen et al., 2019a; Pan et al., 2020a; Choi et al., 2020) . However, these methods usually extend the image-based UDA methods without considering longterm temporal information or action semantics. Song et al. (2021) and Kim et al. (2021b) alleviate these issues with contrastive learning to learn such long-term spatial-temporal representations by instance discrimination. To further understand how contrastive learning helps UDA, we firstly recall that domain-wise discrimination and class-wise closeness are the two main criteria to solve UDA problems (Shi & Sha, 2012; Tang et al., 2020) . Considering an unlabeled video from the target domain as an anchor, we explain that Song et al. ( 2021 By thoroughly studying the effect of the positives and negatives selection from the existing contrastivebased methods in Figure 1 (a), we empirically find that limited variations in intra-domain positives, pseudo-label noise in cross-domain positives and false negatives are the three issues that largely hinder the performance. Explicitly, when selecting the negatives based on the ground-truth, intradomain positives from self-correspondence (Song et al., 2021) is 3.5% below the ones from ground truth. Further, selecting the cross-domain positives based on pseudo-labels induces 2% drop compared to the ground-truth. Importantly, when selecting the negatives from either different instance (Song et al., 2021) or pseudo-labels (Kim et al., 2021b) , the performance of the four baselines in Figure 1 (a) drops by 2 ∼ 4%. Besides, the purity analysis in Figure 1 (b) also indicates that the pseudo-labels of target video are noisy and thus not reliable for selecting cross-domain positives and negatives. Based on the observations, several straightforward questions might be raised: 1. Are there any unexplored intra-domain positives to enrich intra-domain variations? 2. How to alleviate the pseudo-label noise issue when selecting cross-domain positives? 3. How to address the adverse effect from false negatives? To answer the first question, we propose to leverage the temporal and spatial augmentations of the unlabeled target video as intra-domain positives. The rationale is that partially modifying the spatial/temporal information of videos could alter the samples without changing the whole action semantics. Incorporating those samples as positives could help the model be invariant to spatial/temporal shift. Take a step further, we also explore the anchor's nearest neighbors (NNs) in target feature space, which capture rich target-domain variations. By analyzing the purity of the NNs in both source and target domain in Figure 1 (b), we observe that target-domain NNs are clean and thus fit as intra-domain positives. To address the second question, we are motivated from (Xie et al., 2018) that though the pseudo-labels assigned to target samples are noisy, the class-conditional centroids µ t c of target features would weaken this noise by averaging all the target features of same pseudo class. To incorporate it into our contrastive learning framework, we re-formulate the optimization reversely by setting source videos as anchors and then mining target videos as cross-domain positives. We then estimate the gaussian target distribution N (µ t c , Σ t c ) of target features based on pseudo-labels. Given a source video as anchor, synthesized target features can be drawn from this distribution that shares the same class as the source anchor. Consequently, we could leverage those synthesized features as cross-domain positives as the estimated target distribution is robust to pseudo-label noise. To tackle the last question, we present an effective non-contrastive learning framework without relying on negative samples for video domain adaptation. Specifically, we are motivated by the recent



) introduces intra-domain positives (in the form of cross-modal correspondence, e.g, optical flow) to help UDA by learning discriminative representation in the target domain. Additionally, Kim et al. (2021b) utilizes cross(source)-domain positives with the help of pseudo-labels and benefits the class-wise closeness by pushing the samples of the same class but different domains closer.

Figure 1: On HMDB→UCF: (a) Effect of false negatives and intra/cross-domain positives. We study two contrastive methods in video DA: STCDA (Song et al., 2021) with intra-domain positives and LCMC (Kim et al., 2021b) with cross-domain positives. We observe that limited variations in intra-domain positives, pseudo-label noise in cross-domain positives and false negatives are the three issues that largely hinder the performance. (b) Purity comparison among nearest neighbors (NNs) and pseudo label. We found the NNs in target domain are clean and fit as informative and robust intra-domain positives. Note that purity is the cleanness of pseudo-supervision compared to the ground truth.

