CONTINUAL VISION-LANGUAGE REPRESENTAION LEARNING WITH OFF-DIAGONAL INFORMATION

Abstract

Multimodal pre-trained methods with a contrastive learning framework (like CLIP) have recently achieved consistent advantages on various cross-model downstream tasks. However, they usually require a large amount of image-text samples and a vast computing budget for training, which makes the re-training process expensive while the training data is collected continuously (the phenomenon is widespread in real scenarios). In this paper, we discuss the feasibility of continuously training CLIP models based on discrete streaming data. We find that the multimodal retrieval performance of the CLIP in a continual training setting is significantly lower than that in a joint training setting. We name this phenomenon Cognitive Disorder(CD). By tracking the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize the spatial variation of the modal encoders within the CLIP: Intra-modal Rotation and Inter-modal Deviation. Intra-modal Rotation means that the vision and language representation space in the CLIP is rotating greatly around the center of a highdimensional unit sphere during continual training, accompanied by a relatively small change in the topology of the representation space. Inter-modal deviation happens when the vision and language's intra-modal rotation is unsynchronized. Moreover, we empirically and theoretically demonstrate how intra-modal rotation and inter-modal deviation lead to CD. In order to alleviate CD in continual CLIP training, we propose a new continual training framework Mod-X: Maintain offdiagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrixes, the Mod-X helps the model not only better fits the newly trained data domain but also maintains the multimodal cognitive ability on the old data domain during the continual large-scale training (Section 5).

1. INTRODUCTION

Recently, multimodal pre-trained models such as CLIP Radford et al. (2021) 2021), the ability to match image-text samples that are not in its training data distribution is still weak. The natural idea to alleviate this problem is to scale the training data to cover different data domains. However, it is infeasible to train infinite data with limited hardware at once. In this paper, trying to break this non-iterability, we explore the feasibility of continuously training the CLIP model through streaming data, a training paradigm that follows Continual Learning (CL) McCloskey & Cohen (1989) . To simulate continual CLIP training, we randomly and evenly divide the training data (joint-dataset) into multiple sub-datasets and train the CLIP sequentially using these sub-datasets. For comparison with continual training, we train a CLIP additionally from scratch using the joint-dataset, which is named joint training, as the upper bound on the performance of the continuously trained model. Traditional supervised continual learning has been proven to suffer from catastrophic forgetting Rebuffi et al. (2017); Kirkpatrick et al. (2017) . The model's performance on old tasks drops significantly 1



have attracted much attention. By utilizing these pre-trained models, many works have achieved new progress in downstream tasks such as classification Zhang et al. (2020); Wei et al. (2022); Lee et al. (2022), semantic segmentation Xie et al. (2021b); Wang et al. (2021b), object detection Xie et al. (2021a); Wang et al. (2022a), speech recognition Baevski et al. (2020), etc. Although the CLIP model has strong generalization in open-world data, as mentioned in CLIP paper Radford et al. (

