CONTINUAL VISION-LANGUAGE REPRESENTAION LEARNING WITH OFF-DIAGONAL INFORMATION

Abstract

Multimodal pre-trained methods with a contrastive learning framework (like CLIP) have recently achieved consistent advantages on various cross-model downstream tasks. However, they usually require a large amount of image-text samples and a vast computing budget for training, which makes the re-training process expensive while the training data is collected continuously (the phenomenon is widespread in real scenarios). In this paper, we discuss the feasibility of continuously training CLIP models based on discrete streaming data. We find that the multimodal retrieval performance of the CLIP in a continual training setting is significantly lower than that in a joint training setting. We name this phenomenon Cognitive Disorder(CD). By tracking the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize the spatial variation of the modal encoders within the CLIP: Intra-modal Rotation and Inter-modal Deviation. Intra-modal Rotation means that the vision and language representation space in the CLIP is rotating greatly around the center of a highdimensional unit sphere during continual training, accompanied by a relatively small change in the topology of the representation space. Inter-modal deviation happens when the vision and language's intra-modal rotation is unsynchronized. Moreover, we empirically and theoretically demonstrate how intra-modal rotation and inter-modal deviation lead to CD. In order to alleviate CD in continual CLIP training, we propose a new continual training framework Mod-X: Maintain offdiagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrixes, the Mod-X helps the model not only better fits the newly trained data domain but also maintains the multimodal cognitive ability on the old data domain during the continual large-scale training (Section 5).

1. INTRODUCTION

Recently, multimodal pre-trained models such as CLIP Radford et al. (2021) have attracted much attention. By utilizing these pre-trained models, many works have achieved new progress in downstream tasks such as classification Zhang et al. ( 2020 2021), the ability to match image-text samples that are not in its training data distribution is still weak. The natural idea to alleviate this problem is to scale the training data to cover different data domains. However, it is infeasible to train infinite data with limited hardware at once. In this paper, trying to break this non-iterability, we explore the feasibility of continuously training the CLIP model through streaming data, a training paradigm that follows Continual Learning (CL) McCloskey & Cohen (1989) . To simulate continual CLIP training, we randomly and evenly divide the training data (joint-dataset) into multiple sub-datasets and train the CLIP sequentially using these sub-datasets. For comparison with continual training, we train a CLIP additionally from scratch using the joint-dataset, which is named joint training, as the upper bound on the performance of the continuously trained model. 2021) conjecture that the reason is that the contrastive loss is not directly affected by the supervised signal, and the self-supervised framework does not have a SoftMax function to amplify the influence of labels. However, the performance of CLIP with a continual training setting overturns this hypothesis. There is a significant degradation of multimodal retrieval results with continual training compared with joint training (in Section 3 and 5). We name this phenomenon Cognitive Disorder(CD). Due to the vision and language encoders within the CLIP normalizing the representation to a unit vector through a dimension-based L 2 norm, which limits the diffusion of representation vectors length, we try to analyze the representation space variation of modal extractors from a spatial geometry perspective. By tracking the directional changes of the representation vectors in the continuously updated CLIP model (in Section 3), we explore and summarize the spatial variation of the modal encoders within the CLIP: the Intra-modal Rotation and Inter-modal Deviation. The intra-modal rotation refers to the representation space of the single-modal feature extractors (vision and language) within the CLIP that rotates around the center of the high-dimensional sphere, accompanied by a slow topology change during the continual CLIP training. The inter-modal deviation refers to the cognitive deviation of different modal extractors (vision and language) to the same entities during continuous training. Moreover, we empirically and theoretically demonstrate how intra-modal rotation and inter-modal deviation lead to cognitive disorder (in Section 3). To alleviate this cognitive disorder in continual CLIP training, we propose a simple yet effective framework Mod-X: Maintain off-diagonal information-matriX. Unlike contrastive loss Oord et al. ( 2018) only focuses on the proportion of positive and negative sample pairs. The Mod-X framework pays more attention to the distribution of off-diagonal information in the contrastive matrix. The similarity distribution on the off-diagonal illustrates the model's cognition of all entities on current data. By selectively aligning the off-diagonal information distribution of the contrastive matrixes constructed by the current and past models based on the recent training sample, Mod-X helps the model preserve the correct cognition of various old entities while fitting the current vision-language data during continual training. The evaluations in Experiments 5 with different scale and scope datasets show that our Mod-X framework helps the model not only better fits the newly trained data domain (in Section 5.3) but also maintains the multimodal cognitive ability between the current model and old model on the old data domain during the continual large-scale training (in Section 5.4). More technological details and evaluations have been shown in Section 4 and Section 5. In summary, our contributions are as follows: • 



); Wei et al. (2022); Lee et al. (2022), semantic segmentation Xie et al. (2021b); Wang et al. (2021b), object detection Xie et al. (2021a); Wang et al. (2022a), speech recognition Baevski et al. (2020), etc. Although the CLIP model has strong generalization in open-world data, as mentioned in CLIP paper Radford et al. (

Traditional supervised continual learning has been proven to suffer from catastrophic forgetting Rebuffi et al. (2017); Kirkpatrick et al. (2017). The model's performance on old tasks drops significantly as training phases rise. Recently, some work Ni et al. (2021b); Hu et al. (2021) has validated that self-supervised models like SimCLR Chen et al. (2020) and BarlowTwins Zbontar et al. (2021) do not suffer from severe catastrophic forgetting during continual training. Some works Madaan et al. (2021); Thai et al. (

We discuss the feasibility of training the CLIP model continuously through streaming data. Empirical experiments demonstrate that continual CLIP training leads to persistent performance degrades on multimodal retrieval. We name this Cognitive Disorder. • By introducing a series of tools to track the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize the spatial variation of the modal encoders within the CLIP: 1) The Intra-modal Rotation 2) The Inter-modal Deviation. Furthermore, we empirically and theoretically demonstrate how intra-modal rotation and inter-modal deviation lead to cognitive disorder (in Section 3). • We propose a simple yet effective continual CLIP training framework Mod-X that alleviates CLIP's cognitive disorder by selectively aligning off-diagonal information in contrastive matrixes between the past and current models in continual training. 2 RELATED WORK Continual Learning. Continual learning (CL) Thrun (1995), or incremental learning, is mainly focused on supervised tasks. In addition to the vision-based tasks De Lange et al. (2021); Kj et al. (2021); Cha et al. (2021); Ahn et al. (2021), some works discussing language-based tasks Biesialska et al. (2020); Sun et al. (2019). We can summarize the existing continual learning methods into three categories: regularization Kirkpatrick et al. (2017); Ahn et al. (2019); Ni et al. (2021a), replay Rebuffi et al. (2017); Rolnick et al. (2019); Wang et al. (2021a), and architecture Thai et al. (2021); Ni et al. (2021b); Hu et al. (2021); Madaan et al. (2021). In unsupervised and self-supervised based on a

