CROSS MODAL DOMAIN GENERALIZATION FOR QUERY-BASED VIDEO SEGMENTATION Anonymous authors Paper under double-blind review

Abstract

Domain generalization (DG) aims to increase a model's generalization ability against the performance degradation when transferring to the target domains, which has been successfully applied in various visual and natural language tasks. However, DG on multi-modal tasks is still an untouched field. Compared with traditional single-modal DG, the biggest challenge of multi-modal DG is that each modality has to cope with its own domain shift. Directly applying the previous methods will make the generalization direction of the model in each modality inconsistent, resulting in negative effects when the model is migrated to the target domains. Thus in this paper, we explore the scenario of query-based video segmentation to study how to better advance the generalization ability of the model in the multi-modal situation. Considering the information from different modalities often shows consistency, we propose query-guided feature augmentation (QFA) and attention map adaptive instance normalization (AM-AdaIN) modules. Compared with traditional DG models, our method can combine visual and textual modalities together to guide each other for data augmentation and learn a domain-agnostic cross-modal relationship, which is more suitable for multi-modal transfer tasks. Extensive experiments on three query-based video segmentation generalization tasks demonstrate the effectiveness of our method.

1. INTRODUCTION

Query-based video segmentation is first introduced by Gavrilyuk et al. (2018) , which aims to segment the queried actors or objects in video based on the given natural language query. Although these years have witnessed promising achievements in this field, the segmentation model trained on the source domain will degrade dramatically on the unseen target data due to the domain shift in real applications. As we can see in Figure 1 , even if both of the two sentences refer to a standing guy, the visual context complexity, the background environment and the expression styles of the texts in these two cases are quite different, which makes the performance of direct transfer far more unsatisfactory. Although the above methods have achieved great success, few methods have been proposed specially for multi-modal domain generalization tasks. In query-based video segmentation, domain shift not only exists on the image level (e.g. light, weather, background, etc.) and the instance level (e.g. 2021), but also on the natural language level (e.g. expression styles). A natural idea is to gradually augment the source domain data during training to simulate domain shift when facing new domains. However, directly applying the above methods to enhance the two modalities separately and then fuse them together may suffer from negatively affects, since it is difficult to ensure that the generalization directions of these two modalities are consistent. Thus in this paper, we combine the visual and textual modalities together to facilitate each other for data enhancement. According to our observations, actions belonging to the same type of actors in different domains share similar representations in the query-video latent space, the main reason for the performance degradation is that visual background and contextual complexity vary widely across different domains. Although we can not have access to the target data during training, there exists diverse background information in source data, which can be used to augment the source domain. Therefore, we propose a Query-guided Feature Augmentation (QFA) module, which can use the attention scores between query and video frames to distinguish query-related foreground regions from unrelated background regions. Then we keep the foreground regions to be segmented unchanged, and synthesize novel visual features by gradually enhancing the background areas. To ensure the semantic consistency of the generated data, we introduce Moco-based contrastive learning to force the model to maintain query-related information in the background-perturbed video features. Besides, the expression styles of queries among different domains are different, which will lead to deviations in the attention map between visual and query features when migrating to the target domain. Hence we propose to use AdapIN Huang & Belongie (2017) on the vision-to-query attention map (AM-AdaIN) to alleviate this issue. AdapIN can help the model remove the impact of style in attention map during training, and gradually introduce statistics from other samples to help the model learn robust cross-modal relationships. To be summarized, our main contributions are as follows: • We are the first to conduct domain generalization on the query-based video-segmentation task, which is also the early attempt to increase generalization ability on cross-modal task. 



Domain adaptation (DA) Kim et al. (2021); Hoffman et al. (2018); Kim et al. (2019) and domain generalization (DG) Dou et al. (2019); Volpi et al. (2018); Choi et al. (2021) have been proposed to solve these problems. Different from DA that requires the acquisition of target domain data during training, which is usually difficult to achieve, DG can learn domain invariant features and improve domain robustness without requiring target-domain data. Previous DG methods use kernel-based optimization to extract domain-agnostic features Muandet et al. (2013); Li et al. (2018c;b), or use meta-learning to simulate domain-shift situations Li et al. (2018a); Liu et al. (2020). However, most of these methods are only suitable for multiple-source domains. Adversarial data augmentation based methods have been proposed to solve the single-domain generalization problem Volpi et al. (2018); Zhao et al. (2020), which use an adversarial loss to generate more realistic fictitious samples as much as possible. However, it is difficult to generate effective meaningful samples that are largely different from the source distribution in semantic segmentation tasks. Thus recently some other methods, such as instance selective whitening Choi et al. (2021) and memory-guided networkKim et al. (2022)   have been proposed to handle this problem, and achieve good performance.

Figure 1: (a) The demo of our multi-modal domain generalization task. (b) The illustration of the necessity of introducing DG in this task. Source only means the model is trained and tested on the same domain, while the target only means the model is trained on another domain, and directly transferred to this domain. As we can see, the performances drop a lot in both two datasets.

Domain generalization (DG) requires the model to be robust without accessing the target domain when facing domain shift, which means the model trained only on single or multiple source domains should also perform well on unseen target domain. Early models focus on extracting domain-invariant features Li et al. (2018c;b); Hu et al. (2020) or using kernelbased optimization to minimize the dissimilarity across domains Muandet et al. (2013). Dou et al. (2019) propose a domain-agnostic learning paradigm and encourage the model to learn semantically consistent features across training domains. Huang et al. (2021); Xu et al. (2021) use Fourier-based framework to enhance the domain robustness. Other models aim to enhance the source domains in data-level or feature-level to improve the robustness. Yue et al. (2019) use auxiliary dataset to enhance the source images and obtain different styles of images. Meta-learning is also an important method

• We propose two novel QFA and AM-AdaIN modules. Compared with previous DG methods, our model is more suitable for multi-modal generalization task.• Extensive experiments on three generalization tasks show that our model can greatly enhance the model generalization ability, demonstrating the superiority of our method.

