CONTINUAL VISION-LANGUAGE REPRESENTAION LEARNING WITH OFF-DIAGONAL INFORMATION

Abstract

Multimodal pre-trained methods with a contrastive learning framework (like CLIP) have recently achieved consistent advantages on various cross-model downstream tasks. However, they usually require a large amount of image-text samples and a vast computing budget for training, which makes the re-training process expensive while the training data is collected continuously (the phenomenon is widespread in real scenarios). In this paper, we discuss the feasibility of continuously training CLIP models based on discrete streaming data. We find that the multimodal retrieval performance of the CLIP in a continual training setting is significantly lower than that in a joint training setting. We name this phenomenon Cognitive Disorder(CD). By tracking the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize the spatial variation of the modal encoders within the CLIP: Intra-modal Rotation and Inter-modal Deviation. Intra-modal Rotation means that the vision and language representation space in the CLIP is rotating greatly around the center of a highdimensional unit sphere during continual training, accompanied by a relatively small change in the topology of the representation space. Inter-modal deviation happens when the vision and language's intra-modal rotation is unsynchronized. Moreover, we empirically and theoretically demonstrate how intra-modal rotation and inter-modal deviation lead to CD. In order to alleviate CD in continual CLIP training, we propose a new continual training framework Mod-X: Maintain offdiagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrixes, the Mod-X helps the model not only better fits the newly trained data domain but also maintains the multimodal cognitive ability on the old data domain during the continual large-scale training (Section 5).

1. INTRODUCTION

Recently, multimodal pre-trained models such as CLIP Radford et al. (2021) have attracted much attention. By utilizing these pre-trained models, many works have achieved new progress in downstream tasks such as classification Zhang et al. (2020) ; Wei et al. (2022) ; Lee et al. (2022) , semantic segmentation Xie et al. (2021b) ; Wang et al. (2021b) , object detection Xie et al. (2021a) ; Wang et al. (2022a) , speech recognition Baevski et al. (2020) , etc. Although the CLIP model has strong generalization in open-world data, as mentioned in CLIP paper Radford et al. (2021) , the ability to match image-text samples that are not in its training data distribution is still weak. The natural idea to alleviate this problem is to scale the training data to cover different data domains. However, it is infeasible to train infinite data with limited hardware at once. In this paper, trying to break this non-iterability, we explore the feasibility of continuously training the CLIP model through streaming data, a training paradigm that follows Continual Learning (CL) McCloskey & Cohen (1989) . To simulate continual CLIP training, we randomly and evenly divide the training data (joint-dataset) into multiple sub-datasets and train the CLIP sequentially using these sub-datasets. For comparison with continual training, we train a CLIP additionally from scratch using the joint-dataset, which is named joint training, as the upper bound on the performance of the continuously trained model. Traditional supervised continual learning has been proven to suffer from catastrophic forgetting Rebuffi et al. (2017) ; Kirkpatrick et al. (2017) . The model's performance on old tasks drops significantly as training phases rise. Recently, some work Ni et al. (2021b) ; Hu et al. (2021) has validated that self-supervised models like SimCLR Chen et al. (2020) and BarlowTwins Zbontar et al. (2021) do not suffer from severe catastrophic forgetting during continual training. Some works Madaan et al. (2021) ; Thai et al. (2021) conjecture that the reason is that the contrastive loss is not directly affected by the supervised signal, and the self-supervised framework does not have a SoftMax function to amplify the influence of labels. However, the performance of CLIP with a continual training setting overturns this hypothesis. There is a significant degradation of multimodal retrieval results with continual training compared with joint training (in Section 3 and 5). We name this phenomenon Cognitive Disorder (CD) . Due to the vision and language encoders within the CLIP normalizing the representation to a unit vector through a dimension-based L 2 norm, which limits the diffusion of representation vectors length, we try to analyze the representation space variation of modal extractors from a spatial geometry perspective. By tracking the directional changes of the representation vectors in the continuously updated CLIP model (in Section 3), we explore and summarize the spatial variation of the modal encoders within the CLIP: the Intra-modal Rotation and Inter-modal Deviation. The intra-modal rotation refers to the representation space of the single-modal feature extractors (vision and language) within the CLIP that rotates around the center of the high-dimensional sphere, accompanied by a slow topology change during the continual CLIP training. The inter-modal deviation refers to the cognitive deviation of different modal extractors (vision and language) to the same entities during continuous training. Moreover, we empirically and theoretically demonstrate how intra-modal rotation and inter-modal deviation lead to cognitive disorder (in Section 3). To alleviate this cognitive disorder in continual CLIP training, we propose a simple yet effective framework Mod-X: Maintain off-diagonal information-matriX. Unlike contrastive loss Oord et al. (2018) only focuses on the proportion of positive and negative sample pairs. The Mod-X framework pays more attention to the distribution of off-diagonal information in the contrastive matrix. The similarity distribution on the off-diagonal illustrates the model's cognition of all entities on current data. By selectively aligning the off-diagonal information distribution of the contrastive matrixes constructed by the current and past models based on the recent training sample, Mod-X helps the model preserve the correct cognition of various old entities while fitting the current vision-language data during continual training. The evaluations in Experiments 5 with different scale and scope datasets show that our Mod-X framework helps the model not only better fits the newly trained data domain (in Section 5.3) but also maintains the multimodal cognitive ability between the current model and old model on the old data domain during the continual large-scale training (in Section 5.4). More technological details and evaluations have been shown in Section 4 and Section 5. In summary, our contributions are as follows: • We discuss the feasibility of training the CLIP model continuously through streaming data. Empirical experiments demonstrate that continual CLIP training leads to persistent performance degrades on multimodal retrieval. We name this Cognitive Disorder. • By introducing a series of tools to track the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize the spatial variation of the modal encoders within the CLIP: 1) The Intra-modal Rotation 2) The Inter-modal Deviation. Furthermore, we empirically and theoretically demonstrate how intra-modal rotation and inter-modal deviation lead to cognitive disorder (in Section 3). • We propose a simple yet effective continual CLIP training framework Mod-X that alleviates CLIP's cognitive disorder by selectively aligning off-diagonal information in contrastive matrixes between the past and current models in continual training.

2. RELATED WORK

Continual Learning. Continual learning (CL) Thrun (1995) , or incremental learning, is mainly focused on supervised tasks.  R@1 Accuracy % Text-Image R@1 results of CLIPct on Flickr30K(1K) CLIPjt CLIPct Figure 1: The multimodal retrieval R@1 results of CLIP t (0 ≤ t ≤ 5) on test sets COCO (5K) and Flickr30k (1K). The two sub-figures on the left show the Image-Text retrieval R@1 performance of CLIP t on the continual training phase t. The rights show the Text-Image R@1 results. The pentagon points (CLIP jt ) show the results of the CLIP under joint training, which is an upper bound for continual CLIP training (CLIP ct ).

3.1. THE PERFORMANCE OF CONTINUAL CLIP TRAINING

We show the R@1 retrieval results of CLIP t (0 ≤ t ≤ 5) on the 5K test set of COCO (COCO(5K)) and 1K test set of Flickr30k (Flickr30K(1K)) in Figure 1 . By comparing the multimodal retrieval performances of the CLIP 0 (initial phase) and CLIP jt on Flickr30K(1K), we can find that the retrieval performance of CLIP jt is significantly better than that of CLIP 0 , which is not trained on Flickr30k. This phenomenon shows that the performance of the CLIP model is affected by the training data domain, which is consistent with the results of the paper Radford et al. (2021) . Besides this, it can be clearly seen that the multimodal retrieval performance of the CLIP ct on the COCO(5K) has continued to decline with the rise of training phases. The final Image-Text R@1 result of CLIP 5 on COCO(5K) plummeted from the initial 14.7% to 6.1%, and the Text-Image results dropped from 10.6% to 4.7%. The gap with CLIP jt reached 10.0% and 7.0%. On the other hand, CLIP ct exhibits a slow and erratic increase in multimodal retrieval R@1 results on the test set Flickr30K(1K). Although the results between CLIP ct and CLIP jt on the Image-Text R@1 has been narrowed from the original 13.2% to 9.5% while the Text-Image R@1 of CLIP ct has increased from 12.0% to 16.1%, the gap between CLIP 5 and CLIP jt is still great. We name this phenomenon Cognitive Disorder (CD).

3.2. THE REASONS FOR COGNITIVE DISORDER

In CLIP, the vision and language encoders normalize the final representation vector to a unit vector of length 1 using a dimension-based L 2 norm. This design makes the representation space in vision and language encoders form a high-dimensional unit sphere. Based on this fact, we ignore the influence of the representation vectors' length and track their direction changes.

3.2.1. THE INTRA-MODAL ROTATION

Firstly, we analyze the directional changes of the representation vectors of vision and language extractors in continual CLIP training. Taking the visual representation space as an example, we use the visual encoder E V i in CLIP i to extract the image representations of the test set COCO(5K) and obtain the vision representation vectors sets V i = {V 0 i , V 1 i , V 2 i , ..., V N i , ..., V 5K i } , where i = 0, ..., 5 stands for five different training phases. After that, we take the inner product of each pair of vectors < V a i , V b i > in each vector set V i and perform arccos operation to obtain their Self-Angle relationship Matrix (SAM i ). The SAM (a,b) i = arccos(< V a i , V b i >). Any element SAM (a,b) i in the SAM i matrix represents the included angle between the sample a and b in the vision encoder E V i . By counting the difference θ SAM between the corresponding elements in two continual SAM matrix SAM i and SAM i+1 as shown in Figure 2 (a), we get the following included angle change distribution table 2(b).  ! " # ! " $ ! "%& # ! "%& $ ' ()* = ∠ ! " # , ! " $ -∠(! "%& # , ! "%& $ ) 1 " ! : 1 "%& ! : (a) θSAM ∈ [0 • , 5 • ] (5 • , 10 • ] (10 • , 15 • ] (15 • , 20 • ] (20 i-j = |SAM i -SAM j |. From the table 2(b), we can find that 80% of the angle changes between any two vision representation vectors in the vision representation space are between 0 and 10 degrees in the process of continual CLIP training, while only 20% are above 10 degrees. Moreover, less than 1% of the angle changes are above 20 degrees. Those angle changes between 15-20 degrees also only account for about 5% of all image pairs. Therefore, we can conclude that the topology of the visual representation of the CLIP ct changes slowly during the continual CLIP training. In Appendix 7.9, we discuss this conjecture by comparing the representation quality of vision encoders. In addition to discussing the change in the included angle between sample pairs in the visual representation space, by taking the inner product of the same sample's vision representation vector from different training phases' vision encoder E V i , we use the arccos operation to compute the rotation angles θ RAM of each test sample in vision encoder E V i and E V j and get the Rotation Angle Matrix RAM (i,j) . The RAM a (i,j) = arccos(< V a i , V a j >), where the a is the label of sample. The schematic diagram can be seen in 3(a). By counting the distribution of rotation angles, we get the following rotation angle distribution table 3(b). By observing the table 3(b), we can find that the direction of the same sample in the visual representation space of different training phases has changed greatly. Only less than 0.4% samples are rotated within 20 degrees in the continual CLIP training, while the samples rotated within 20-25 degrees are at most less than 9%, and the samples of 25 degrees and above account for more than 90%. We speculate that the vision representation space of CLIP ct has undergone a large rotation around the high-dimensional sphere center during the continual training. After analyzing the language representation space, we reach the same conclusion as the vision representation space. Detailed SAM and RAM distribution of language encoders can be viewed in Appendix 7.2. ! " # ! $ # % &'( = ∠ ! " # , ! $ # , " ! : , $ ! : (a) θRAM ∈ [0 • , 15 • ] (15 • , 20 • ] (20 • , 25 • ] (25 • , 30 • ] (30 According to our analysis of the geometric changes of the single-modal encoder's representation space during continual CLIP training, we conclude that: During the continual CLIP training, the representation space in the CLIP ct is significantly rotated. The topological structure of the representation space is slightly rotated compared with the rotation of the whole representation space. We name this phenomenon Intra-modal Rotation. 

3.2.2. THE INTER-MODAL DEVIATION

Although the topology of the single-modal representation space changes during continual training, this slight rotation should not be the main reason for the degradation of CLIP's multimodal retrieval performance. To this end, we conduct a thought experiment: it is known that the representation spaces of vision and language encoders exhibit significant spatial rotations during continual training. Now we assume that the topology of the single-modal representation space is completely fixed during continual training. Therefore, if the CLIP ct 's performance on multimodal retrieval does not degrade during continual training, the rotations of the two encoders' representation spaces should be synchronized. However, the fact is the opposite. So we think there is a deviation between the rotation of the vision and language representation spaces. Based on this suppose, we compare the rotation distributions of vision encoder (Figure 3 (b)) and language encoder (Table 7 (b)) and draw the rotation distribution comparison diagram (Figure 4 ). The values under the same color represent the proportion of test samples to total samples in each rotation angle interval of the same modality. Comparing the difference in the distribution of rotation angles of the vision and language encoders, we can see that the space rotations of the two encoders are very different in the continual training. The rotation of language representation space is mostly concentrated between 20-30 degrees, while the vision's rotations are mostly between 30-180 degrees. This shows that the rotation of the representation space of the two modal extractors within CLIP ct is not synchronized during the continual training, which verifies our previous inference: The unsynchronized rotation of the vision and language representation spaces leads to cognitive deviations between the CLIP's modal encoders (vision and language). We name this phenomenon Inter-modal Deviation.

3.2.3. INTRA-MODAL ROTATION AND INTER-MODAL DEVIATION LEAD TO COGNITIVE DISORDER

Based on the above exploration, we can conclude that intra-modal rotation and inter-modal deviation play a key role in CLIP ct 's cognitive disorder. However, how do they cause the model to misalign the old sample's vision and language representation? We show a schematic here to illustrate this. As shown in Figure 5 , the α is vision representation and β is language representation. The a,b denote different image-text samples. For the convenience of illustration, we set the unsynchronous rotation of the two modal spaces as the visual modal's static and the language's relative rotation. When intra-modal rotation happens 5(a), β a in training phase t + 1 is rotated to β ′ a , the modal similarity between a and b shift from (β T a α a > β T a α b ) to (β ′ T a α a < β ′ T a α b ) , which break the alignment of the current model to old sample a. The superscript T is a transpose operation that is often used for matrix multiplication. When inter-modal deviation happens 5(b), the relative rotation of the representation space breaks the original modal alignment of sample a which makes the (β T a α b > β T a α a ). Because of this, the performance of CLIP ct drops significantly during continual training. Detailed mathematical derivations can be found in Appendix 7.1. Training Data Domain t+1: ! " # " ! $ # $ ! " # " ! $ # $

Phase t:

Phase t+1:

Inter-modal Deviation

Training Data Domain t: ! " # " ! $ # $ Phase t: Phase t+1:

Intra-modal Rotation

Training Data Domain t: Training Data Domain t+1: ! " # " ! $ # $ # " % (a) Training Data Domain t+1: ! " # " ! $ # $ ! " # " ! $ # $

Phase t:

Phase t+1:

Inter-modal Deviation

Training Data Domain t: ! " # " ! $ # $

Phase t:

Phase t+1:

Intra-modal Rotation

Training Data Domain t: Training Data Domain t+1: ! " # " ! $ # $ # " % (b) Figure 5 : The Schematic illustration of cognitive disorder caused by intra-modal rotation and intermodal deviation.

4.1. GENERAL CONTINUAL CLIP TRAINING SETTING

Suppose we have used training dataset D 0 got a pre-trained model CLIP 0 . And there is another vision-language dataset D. We split D into N sub-datasets {D 1 , ..., D N }, randomly and evenly, to simulate a stream of data and D t = {(v 0 t , l 0 t ), ..., (v n t , l n t )} denotes the training data in the training phase t, where t ∈ {1, 2, ..., N }. Then training the model CLIP 0 using sub-datasets sequentially. The enocded l 2 normalized embeddings of vision and text is  V i t = E t V (v i t ) and L i t = E t L (l i t ).

4.3. COGNITION ALIGN

The diagonal elements in CLIP's contrastive matrix represent the similarity of the model's understanding of the visual and language information of the current sample. The off-diagonal elements represent the similarity between the vision and language representation of the current sample and other samples or vice versa. The distribution of off-diagonal elements in the contrastive matrix represents the current model's cognition about the current training objects. So we use Cognition Align to distill the old model's "world view" of current samples to help the current model maintain the cognitive ability on past entities. Firstly, we construct contrastive matrix M t-1 and M t using the last and current model CLIP t-1 and CLIP t based on current sub-dataset D t . M i,j t-1 = CLIP t-1 (D t ) = s(E t-1 V (v i t ), E t-1 L (l j t )) M i,j t = CLIP t (D t ) = s(E t V (v i t ), E t L (l j t )) Where the s(a, b) = (a) T b is the cosine similarity function. However, the last model's cognition for current data is not totally correct. For those misunderstood sample information (diagonal elements are not the largest in the current retrieval), we use the corresponding similarity information of the current model to replace them. Thereby removing their influence on the current model during distillation. M (i,:) t-1 = M (i,:) t ; if max(M i t-1 ) ̸ = i (3) After that, we align the information matrix M t-1 , which is selected by Screening, using Kullback-Leibler Divergence Csiszár (1975) . L t KL (M t , M t-1 ) = - M t-1 ln( M t M t-1 ) The final training loss can be joint in L M od-X and α is a hyper-parameter. 1K) COCO(5K) R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 COCO CLIP0 L t M od-X = L t Inf oN CE + αL t KL (5) 16.9 37.0 From the results in the Table 1 , it is clear that our method CLIP M od-X maintains its multimodal retrieval results on COCO(5K) after completing continual training on Flickr30K. The gap between CLIP 0 and CLIP M od-X is just 0.2% points in image-text retrieval and 0.5% points in text-image retrieval on COCO(5K). At the same time, the retrieval results of the CLIP M od-X on the test set Flickr30K(1K) are also affected by the training domain and have a significant increase. The R@1 performance of the CLIP M od-X in image-text retrieval rise from 16.9% (in CLIP 0 ) to 27.9%. And the R@1 results in text-image retrieval increase from 12.0% (in CLIP 0 ) to 20.2%. The performance gap between CLIP M od-X and CLIP jt on the Flickr30K is only at most 2.3% points. Conversely, due to the model's cognitive disorder in continual training, the performance of CLIP ct on COCO(5K) drops significantly. In addition, although the performance of CLIP ct on Flickr30K(1K) has improved, it is still far from the upper bound CLIP jt . The EWC, as a typical regularization strategy in continual learning, selectively updates the model by evaluating the importance of each parameter of the model. From the above experimental results, although CLIP EW C improves the accuracy of continual CLIP training on Flickr30K(1K), it does not preserve the model's understanding in past samples (COCO(5K)). According to the above comparisons, we can conclude that our Mod-X framework can not only maintain the cognitive ability on past samples during continual CLIP learning but also improve the model's fitting ability to the current training data domain.

5.4. PERFORMANCE OF MOD-X IN THE EXPERIMENT B

Unlike Experiment A, which looks like fine-tuning, Experiment B focuses on verifying that our framework Mod-X can maintain or even improve the model's understanding of past entities when iterating under continual large-scale training. 1K) COCO(5K) R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Inter-modal Deviation and Intra-modal Rotation can influence the CLIP's sample similarity matrix, but this does not necessarily lead to errors in multimodal retrieval results. Unless the similarity of the visual language representation of the model for the same sample is smaller than that between different samples. In there, we abstract this problem and give the theoretical conditions that the Intra-modal Rotation and Inter-modal Deviation leads to the cognitive disorder. There has N image-text pairs {(α 1 ,β 1 ),(α 2 ,β 2 ),(α 3 ,β 3 ),...,(α i ,β i ),...,(α N ,β N )} ∈ R W ×W . Through function M(α) and Q(β), M ̸ = Q, the Euclidean space A and B of images and texts are formed. A = span{M(α 1 ), M(α 2 ), M(α 2 ), ..., M(α i ), ..., M(α N )} B = span{Q(β 1 ), Q(β 2 ), Q(β 2 ), ..., Q(β i ), ..., Q(β N )} (6) The M(α i ),Q(β j ) ∈ R D and ∥M(α i )∥ = 1 , ∥Q(β j )∥ = 1, i, j = 1, 2, 3, ..., N . < M(α i ), Q(β j ) > is the cosine between M(α i ) and Q(β j ) , j = 1, 2, 3, ..., N . Suppose: ∃(α a , β a ), (α b , β b ) ∈ {(α i , β j ), i, j = 1, 2, 3, ..., N } and a ̸ = b makes: < M(α a ), Q(β a ) > = arg max βi=βa < M(α a ), Q(β i ) > < M(α b ), Q(β b ) > < arg max βj ̸ =β b < M(α b ), Q(β j ) >

7.1.1. INTER-MODAL DEVIATION LEADS TO COGNITIVE DISORDER

Prove: There is a rotation matrix pair (A,B) that not only keeps the A and B topology unbiased and makes the < M ′ (α a ), Q ′ (β a ) > < arg max βi̸ =βa < M ′ (α a ), Q ′ (β i ) > < M ′ (α b ), Q ′ (β b ) > = arg max βj =β b < M ′ (α b ), Q ′ (β j ) > where the M ′ = A(M) and Q ′ = B(Q), A ̸ = B. And the space A and B can be written as A ′ and B ′ : A ′ = A(A) = span{M ′ (α 1 ), M ′ (α 2 ), M ′ (α 2 ), ..., M ′ (α i ), ..., M ′ (α N )} B ′ = B(B) = span{Q ′ (β 1 ), Q ′ (β 2 ), Q ′ (β 2 ), ..., Q ′ (β i ), ..., Q ′ (β N )} Solution: the Equ.7 can be written as: < M(α a ), Q(β a ) > -< M(α a ), Q(β i ) >> 0, ∀β i ∈ β, i ̸ = a < M(α b ), Q(β b ) > -< M(α b ), Q(β j ) >< 0, ∃β j ∈ β, j ̸ = b (10) hence: M(α a ) T Q(β a ) -M(α a ) T Q(β i ) > 0, ∀β i ∈ β, i ̸ = a M(α b ) T Q(β j ) -M(α b ) T Q(β b ) > 0, ∃β j ∈ β, j ̸ = b because the rotation matrix pair (A,B) can be seen as a rotation matrix R(θ D ), where the θ D is a rotation angle between AB and A ′ B ′ . Hence, when applying this rotation matrix R(θ D ), the Equ.11 can be written as: M(α a ) T R(θ D )Q(β a ) -M(α a ) T R(θ D )Q(β i ) < 0, ∃β i ∈ β, i ̸ = a M(α b ) T R(θ D )Q(β j ) -M(α b ) T R(θ D )Q(β b ) < 0, ∀β j ∈ β, j ̸ = b Because the rotation matrix satisfies that the inner product of itself is 1. So, Equ 12 can be written as: M(α a ) T R(θ D )(Q(β a ) -Q(β i )) < 0, ∃β i ∈ β, i ̸ = a, R(θ D ) T R(θ D ) = I (13) M(α b ) T R(θ D )(Q(β j ) -Q(β b )) < 0, ∀β j ∈ β, j ̸ = b, R(θ D ) T R(θ D ) = I (14) For example, when R(θ D ) = -I, then R(θ D ) T M(α a ) = -M(α a ) the equ 15 and 16 will hold. So, rotation matrices (A,B) that makes Equ.8 true exists.

7.1.2. INTRA-MODAL ROTATION LEADS TO COGNITIVE DISORDER

Since intra-modal rotation just requires the length of representation vectors after rotation is 1 and does not require that the intra-modal representation space is invariant, it is a more general case of inter-modal deviation. This means that all rotation matrixes that satisfy 7.1.1 can also satisfy Intra-modal Rotation. Different from intra-modal deviation, the inner product of the mapping matrix P does not require to be 1. So, we rewrite the Equ 15 and 16 to: M(α a ) T (Q(β a ) -Q(β i ))P < 0, ∃β i ∈ β, i ̸ = a, M(α b ) T (Q(β j ) -Q(β b ))P < 0, ∀β j ∈ β, j ̸ = b, Any mapping matrix P that rotation the direction of (Q(β j ) -Q(β b )) by more than 90 degrees.

7.2. DETAILED SAM AND RAM DISTRIBUTION OF LANGUAGE ENCODERS

The topological structure of the language representation space does not change significantly with the continual CLIP training. But the whole language representation space, like the vision representation space, has a large rotation around the center of the high-dimensional sphere during the continual training. The angle change distribution table 7(a) and rotation angle distribution table 7(b) are shown below. θSAM 7 , we can find that more than 88% of the angle change between any two language representation vectors in the language representation space are between 0 and 10 degrees in the process of continual CLIP training, while only 20% are above 10 degrees. Moreover, less than 0.2% of the angle changes is above 20 degrees. Those angle change between 15-20 degrees also only account for about 1.5% of all images pairs. Similar to the visual representation space, the direction of the same sample in the language representation space of different training phases also has changed greatly. However, unlike most of the rotations in the vision representation space, which are distributed over 30 degrees, in the language space, the rotations in the representation space are mostly distributed between 20 and 30 degrees. Because of this difference, the alignment of the CLIP for different modalities of the same sample deviates during the continual training. ∈ [0 • , 5 • ] (5 • , 10 • ] (10 • , 15 • ] (15 • , 20 • ] (20

7.3. THE RELATIONSHIP BETWEEN CONTRASTIVE MATRIX, INTRA-MODAL ROTATION, INTER-MODAL DEVIATION AND MOD-X

From a detailed point of view, the element M i,j in the i,j position of the contrastive matrix M is the similarity score of the i'th sample vision embedding and the j'th sample text embedding. Since the length of the representation vector is 1, the similarity score M i,j also refers to the angle between the i'th sample vision embedding and the j'th sample text embedding. Greater similarity means a smaller angle. Therefore, the value of the diagonal elements in the contrast matrix M represents the angle between different modals of the same sample. The value of the off-diagonal elements represents the angle between the different modals of different samples in the CLIP's representation space. Through our exploration (in section 3), the Intra-modal Rotation and the Inter-modal Deviation affect these angles or similarity scores. From an overall perspective, the similarity distribution of the contrastive matrix M is equivalent to the structure of the representation space of the model. Our Mod-X framework attempts to distill the similarity distribution of off-diagonal elements identical to distilling the model's representation space structure, which reduces the influence of intra-modal rotation and inter-modal deviation during continual CLIP training. To better illustrate the relationship between the model's representation space and the model's similarity performance, we add a more direct statistical analysis, inter-modal angle variation distribution. Based on the settings in section 3, in the training phase t, we compare the change of angle distribution between modalities for the training samples retrieved correctly in the training phase t-1. A schematic diagram of inter-modal angle variation θ ImAV is shown in Figure 8(a) , where the sample a refers to the training sample that can be retrieved correctly by model CLIP t-1 in training phase t -1. The V is the vision representation and L is the language representation. Inter-modal angle variation distribution table can be seen in Figure 8(b) . ! " # $ " # ! "%& # $ "%& # ' ()*! = ∠ ! " # , $ " # -∠(! "%& # , $ "%& # ) 1 " : 1 "%& : (a) θImAV ∈ [0 As shown in Figure 8(b) , during the continual training, the samples that were correctly retrieved in the past have apparent changes in the angle between the modalities as the training phases go up. Only less than 50% of the samples change within 5 degrees in the continual training, and about 30% of the samples have a change of 5-10 degrees. However, more than 20% of the samples change their included angle by more than 10 degrees during the training process. This shows that the inter-modal spatial alignment (similarity performance) of the CLIP ct is affected by Intra-modal Rotation and Inter-modal Deviation. To illustrate our Mod-X framework indeed alleviates the distribution shift in representation space between sample modalities during continual training, we show the inter-modal angle variation distribution of the CLIP M od-X in Experiment A in Table 3 . Comparing the Figure 8 (b) and Table 3 , it can be found that the CLIP M od-X well maintains the inter-modal spatial alignment of the correctly retrieved samples during the continual CLIP training. From Figure 9 , we can find that when the CLIP is trained on a specific data domain, the rotation of visual representation space becomes more severe, among which more than 70% of the samples have more than 30 degrees of rotation in the visual space, which is higher than that of the open-world dataset. Although the rotation of more than 30 degrees in the language space has also seen a large proportional increase than the open-world dataset, it is still significantly out of sync with the rotation in the visual space. Most samples are rotated within 30 degrees in language space. Through this validation, we show that inter-modal deviation (rotational asynchrony) of the representation space of different modal encoders persists during the continual CLIP training on a specific data domain. θImAM ∈ [0 • , 5 • ] (5 • , 10 • ] (10 • , 15 • ] (15 • , 20 • ] (20 • , 180 • ] ImAM (0,

7.4. DETAILED EXPERIMENT SETTING

In exploration experiments 3 and main experiments 5, we use RN50 In Experiment B, due to the size of training data reaches to 1 million, we increase the batch size to 800 and increase the initial learning rate to 1e -3 . And other hyper-parameters is consistent with Experiment A. The detailed hyper-parameters in table 3(b).

7.4.1. THE SENSITIVITY OF HYPER-PARAMETER α

In this section, we discuss the effect of different α on the final performance of the CLIP M od-X based on the settings of Experiment A 5.3. Table 5 presents the final retrieval results of the CLIP M od-X model with α = 10, 15, 20, 25, 30.

Pretraining Dataset Model

Image-Text Retrieval(%) Text-Image Retrieval(%) Flickr30K(1K) COCO(5K) Flickr30K(1K) COCO(5K) R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 COCO CLIP0 16.9 37.0 46.2 14.7 34. From the table, we can find that although different α affects the performance of the CLIP M od-X , different α does not significantly affect the effectiveness of the Mod-X framework. The performance of CLIP M od-X is better than CLIP ct under different α. As α increases, the CLIP M od-X better maintains its retrieval ability on past COCO samples. The Image-Text R@1 and Text-Image R@1 on COCO(5K) remain around 14.5% and 10.0%. However, an excessively large α also limits the model's ability to fit new datasets. With the value of α increased from 20 to 30, the Image-Text R@1 and Text-Image R@1 of the CLIP M od-X on the Flickr30k(1K) drops from 27.9% and 20.2% to 25.2% and 18.4%.

7.5. THE DETAILED PERFORMANCE OF THE MODELS AT EACH TRAINING PHASE IN EXPERIMENT A

In figure 10 , we show the effect of our framework Mod-X (CLIP results of multimodal retrieval at each training phase, we can find that our framework, Mod-X, still has a good performance on the past pre-training dataset COCO(5K) during the continual training on the Flickr30K dataset. At each training phase, the R@1 results of CLIP M od-X on COCO(5K) did not show a significant drop, and the gap with the initial accuracy (CLIP 0 ) remained at ±1%. In addition to this, by comparing the retrieval performance of the CLIP ct and CLIP M od-X on the current training data domain (Flickr30K), it can be found that the CLIP M od-X is also significantly better than CLIP ct in continual fitting the current data domain. The low performance of traditional regularization methods EWC also shows that continual multimodal training is more complex than single-modal supervised training.

7.6. THE DETAILED PERFORMANCE OF THE MODELS AT EACH TRAINING PHASE IN EXPERIMENT B

In figure 11 , we show the effect of our framework Mod-X (CLIP M od-X ) in each training phase in large-scale continual training settings (the Experiment B 5.4). Comparing the R@1 results of the three continual training strategies at each training phase, we can clearly see that our Mod-X framework performance in each training phase is stabilized. As the training phase continues to rise, the performance of the Mod-X improves relative to the initial pre-training results (initial). The Image-Text and Text-Image R@1 results on COCO(5K) had risen by 1.7% and 0.5% points, respectively. And the gap of Image-Text R@1 results on the Flickr30K(1K) between CLIP jt and CLIP M od-X narrowed from 6.3% to 2.0%. The performance of Text-Image R@1 on Flickr30K(1K) also improved to 25 Text-Image Retrieval(%) Flickr30k(1K) COCO(5K) Flickr30k(1K) COCO(5K) R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Table 6 : The final multimoal retrieval performance of the CLIP ct , CLIP M od-X and CLIP jt on COCO(5K) and Flickr30K(1K).

CLIPct

Comparing the final performance of the three training strategies, Mod-X framework (CLIP M od-X ) still outperforms CLIP ct in the large-scale pre-training. After continual pre-training, the CLIP M od-X obtain 40.40% Image-Text R@1 result and 27.74% Text-Image R@1 result on Flickr30K(1K) test set, which surpasses the 35.50% and 24.54% of CLIP ct . The results on COCO(5K) are similar to those on Flickr30K(1K). The Image-Text R@1 result of CLIP M od-X on COCO(5K) is 4.68% points higher than CLIP ct and the Text-Image R@1 result of CLIP M od-X on COCO(5K) exceeds CLIP ct 2.12% points. The detailed R@1 performance of three training strategies at each training phase can be seen in Figure 12 . COCO(5K) Flickr30k(1K) COCO(5K) R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 The performance of our framework Mod-X is still better than CLIP ct on all of the evaluation settings. Comparing the R@1 results on the test set Flickr30K(1K), we can find that CLIP M od-X not only surpasses the initial results (CLIP vit32 ) but also 1.3% points and 2.2% points higher than CLIP ct . The results on COCO(5K) also illustrate that our framework not only resists the cognitive disorder of the model but also fits the new data domain better than CLIP ct . The R@1 results of CLIP M od-X on COCO(5K) surpasses the CLIP ct by 2.4% and 2.7% points, respectively. 1K) COCO(5K) EC(5K) Flickr30k(1K) COCO(5K) EC(5K) R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 Comparing the R@1 performance of the CLIP f t and CLIP vit32 on the EC(5K) test set, we once again show that the training of the CLIP model is affected by the training data domain. The R@1 results of CLIP f t on EC(5K) is 8.8% and 9.9% points higher than CLIP vit32 . However, the retrieval results of CLIP f t on COCO(5K) and Flickr30K(1K) have dropped by more than 10% points on average, which also means that fine-tuning (one phase continual training) the CLIP model will lose its ability to retrieve past samples. This is also verified by observations of the retrieval performance of CLIP ct performs lower than CLIP f t . On the contrary, the CLIP M od-X obtained after continual training by the Mod-X framework only has a tie drop of 3.3% points in the R@1 retrieval results on COCO(5K) and Flickr30K(1K). What's more, the performance of the CLIP M od-X on EC(5K) outperformed CLIP ct by 3.5% points and 4.2% points on Image-Text R@1 and Text-Image R@1, respectively. All of this shows that Mod-X framework not only preserves the inter-modal spatial structure of old samples during the continual training but also improves the fitting ability of the CLIP in the current training data domain. Figure 13 presents the R@1 retrieval performance of this three training strategies on the COCO(5K), Flickr30K and EC(5K) at each training phase. The trend of R@1 performance of these three training strategies during continual training on three test sets also illustrates that the Mod-X framework significantly alleviates the cognitive disorder during the continual CLIP training. Observing the changing trends in the linear evaluation accuracy of each training phase, we can find that the representation quality of the vision encoder in CLIP cl gradually decreases as the training phase increases. The top-1 accuracy in the ImageNet test set dropped from 30.1% to 28.1%, which is consistent with our conjecture 3.2.1. Compared to the decline in multimodal retrieval, the decrease in the quality of visual representations appears to be negligible. In addition, by comparing the results of CLIP M od-X and CLIP jt , we can find that our Mod-X framework can not only help the model fit new image-text samples but also improve the representation quality of the modal encoders within the CLIP. The top-1 accuracy of the vision encoder in CLIP M od-X improved from 30.1% to 32.0%. All of this also illustrates that the quality of the extractor representation is not precisely positively correlated with the cognitive ability of the model. The trend is similar to its performance on multimodal retrieval. Although our Mod-X framework maintains the quality of the modal encoders and the top-1 accuracy of vision encoders increased to 36% from 34.25%, the gap between CLIP jt still has 4%. We think how to shorten this gap will become a question worth considering.



Figure 2: The sub-figure on the left shows a schematic diagram of computing θ SAM . The table on the right shows the distribution of the change of the included angle between any two samples in different training phases' vision representation space. And SAM i-j = |SAM i -SAM j |.

Figure 3: The sub-figure on the left shows a schematic diagram of computing θ RAM . The table on the right shows the rotation angle distribution of the same samples in different training phases.

Figure 4: The comparison of the rotation distributions of the representation spaces of the vision and language extractors during continual CLIP training. CLIP i-j refers to the CLIP's continual training from training phase i to j.

Figure 6: The Mod-X framework mainly consists of two subsections. Cognition Align helps the current model align the cognition of the old model based on current data. And Contrative helps the model fit the current training data domain. The S i,j means cosine similarity score of the i'th sample vision embedding and the j'th sample text embedding.

, We explore the feasibility of training the CLIP model continuously through streaming data and name its performance decline in multimodal retrieval as Cognitive Disorder(CD). Then, By tracking the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize the spatial variation of the modal encoders within the CLIP: Intra-modal Rotation and Inter-modal Deviation. Moreover, we mathematically demonstrate how intra-modal rotation and inter-modal deviation lead to CD. To alleviate the cognitive disorder of the continual CLIP training, we propose a simple yet effective continual learning framework Mod-X: Maintain off-diagonal information-matriX. The results in experiments (5.3 and 5.4) demonstrate the effectiveness of our framework. 7 APPENDIX 7.1 THE THEORETICAL DEMONSTRATE THAT INTER-MODAL DEVIATION AND INTRA-MODAL ROTATION LEAD TO COGNITIVE DISORDER

Figure 7: Detailed SAM and RAM Distribution of Language Encoders.

Figure 8: The sub-figure on the left shows a schematic diagram of computing θ ImAV . The table on the right shows the distribution of the change of the included angle of the vision and language representation of the samples, which were correctly retrieved in the previous training phase.

Figure 9: The comparison of the rotation distributions of the representation spaces of the vision and language extractors during continual CLIP training on ECommerce-T2I. CLIP i-j refers to the CLIP's continual training from training phase i to j.

The final multimodal retrieval performance of different α on continual CLIP M od-X training in the Experiment A.

Figure 10: The performance of different training strategies in each training phase in Experiment A.

Figure 11: The performance of different training strategies in each training phase in Experiment B.

Figure 13: The R@1 retrieval performance of different training strategies in each training phase on EC(5K), COCO(5K) and Flickr30K(1K).

Figure 14: The changes in the representation quality of visual encoders in Experiment A.

Figure 15: The changes in the representation quality of visual encoders in Experiment B.

the latest workThai et al. (2021);Ni et al. (2021b);Hu et al. (2021);Madaan et al. (2021) has drawn some conclusions different from those of supervised. However, only a few worksSrinivasan et al. (2022);Fan et al. (2022) focus on incremental multimodal tasks learning. Because of the cooperation between different models, continual multimodal pre-training shows different performance and problems from single modal continual training. Chowdhury et al. (2022). At the same time, the large-scale image-text datasets e.g., Laion400M Schuhmann et al. (2021) and Conceptual Captions Sharma et al. (2018) has played a key role in multimodal pre-training. Although large-scale open-world datasets contain various samples, pre-trained model still cannot perfectly match image-text sample pairs that are not in its training data domain Radford et al. (2021).

• , 180 • ]

We apply our Mod-X framework to two experiment settings to illustrate that Mod-X helps the model not only better fits the newly trained data domain but also maintains the multimodal cognitive ability of the old data domain during the continual large-scale training. Experiment A: The Experiment A follows the setup of the exploratory experiments, pre-training the CLIP 0 model using the COCO dataset. And then splitting Flickr30K dataset randomly and evenly into five sub-datasets and sequentially training them to update CLIP This table shows the final multimodal retrieval performance of the different training strategies on COCO(5K) and Flickr30K(1K). The CLIP 0 is the initial pretrained model based on the COCO dataset. The CLIP ct means training CLIP continually without any other operation. The CLIP EW C means using the EWC methodKirkpatrick et al. (2017), which is a typical regularization strategy in continual supervised learning. The CLIP M od-X is the proposed Mod-X, and the CLIP JT is the joint training model using the joint datasets (COCO+F30K), which is an upper bound for continual CLIP training in Experiment A. The detailed performance of the different models at each training phase is shown in Appendix 7.5.

The final multimodal retrieval performance of the different continual CLIP training strategies in the Experiment A.

Table 2 shows the final multimodal retrieval performance of the different continual CLIP training strategies in Experiment B. The detailed R@1 results in each training phase can be seen in Appendix 7.6.

The final multimoal retrieval performance of the different continual CLIP training strategies in the Experiment B.Compared to the initial pre-trained model CLIP 0 , our CLIP M od-X not only does not show a significantly drop such as CLIP ct in multimodal retrieval performance on COCO(5K) and Flickr30K(1K) but it also shows a slight improvement after continual training in SBU-1M. The image-text R@1 results on Flickr30K(1K) increase from 33.0% to 37.3% and the accuracy on COCO improved to 19.5% from 17.7%. The performance of text-image R@1 is the same as that of image-text. The accuracy increase from 23.4% to 26.0% on Flickr30K(1K) and 12.9% to 13.4% on COCO(5K). The gap between continual CLIP training and joint training has been somewhat narrowed. Conversely, the ability of cognitive alignment in CLIP ct and CLIP EW C are lost in the large-scale data. This results in verifying that our framework Mod-X can iterate under continual large-scale training and can maintain or even improve the model's cognitive ability of past entity concepts.

• , 15 • ] (15 • , 20 • ] (20 • , 25 • ] (25 • , 30 • ] (30 • , 180 • ]

The table shows the distribution of the change of the included angle of the vision and language representation of the CLIP M od-X in Experiment A.On average, 90% of the correctly retrieved samples have an angle change of less than 5 degrees in continual training, and the samples with an angle change of more than 15 degrees account for less than 1% of all samples. All of this shows that the Mod-X framework mitigates the cognitive disorder during continual CLIP training by preserving the inter-modal spatial alignment of the samples retrieved correctly in the past during the continual training.

He et al. (2016) as the vision encoder and language encoder is a transformer-based architecture which follows modification proposed in CLIP OpenAI. The input images are resized to 224 × 224 and the input texts are tokenized by WordPiece with a maximum length of 77. We utilize AdamWLoshchilov & Hutter (2017) optimizer and a cosine annealing learning rate schedule with warmup which is consistent with OpenAI. All of the experiments are conducted on 8 NVIDIA V100 GPUS. In exploration Experiment and Experiment A, we use the hyper-parameters as be shown in table 3(a). Since the size of training data in our exploration experiment and experiment A is relatively small compared to large-scale pre-training, we set a smaller batch size. And other hyper-parameters is consistent with CLIP OpenAI.

Table (a)  is the hyperparameter in exploration experiment (Section 3) and Experiment A (Section 5.3).Table (b)  is the hyperparameter in Experiment B (Section 5.4).

M od-X ) in each training phase and compare the performance with continual CLIP training strategies in Experiment A 5.3. From the

.98% from 23.44%. Conversely, the ability of cognition alignment in CLIP ct and CLIP EW C are lost in the continual large-scale training. This results in verifying that our framework Mox-X can iterate under continual large-scale training and can maintain or even improve the model's cognitive ability of past entity concepts.

THE PERFORMANCE OF MOD-X WHEN FINE-TUNING THE OPENAI'S CLIP In this section, we simulate continual CLIP training based on OpenAI's pre-trained model CLIP vit32 with ViT-32/B vision encoder OpenAI. 7.8.1 THE PERFORMANCE OF MOD-X WHEN FINE-TUNING THE OPENAI'S CLIP ON COCO AND FLICKR30K DATASET We set the CLIP vit32 as the initial model in the continuous training process and divide the jointdataset (COCO and Flickr30K) into five sub-datasets uniformly and randomly to simulate streaming data. Because the pre-training datasets of CLIP vit32 are not available, we train CLIP vit32 on the joint-dataset to get the model CLIP f t as an upper bound for the performance of continual training. We apply our framework Mod-X in this setting and compare the final multimodal retrieval results with CLIP ct , which is just continual training without any other operations, in Table 7.

The final multimoal retrieval performance of the CLIP ct , CLIP M od-X and CLIP f t based on OpenAI's CLIP vit32 with VIT-32/B vision encoder.

PERFORMANCE OF MOD-X WHEN FINE-TUNING THE OPENAI'S CLIP ON ECOMMERCE-T2I DATASETTo illustrate that the Mod-X framework is not only applicable to open-world datasets, in this section, we compare the performance of CLIP M od-X with CLIP ct and CLIP f t for continual training on a specific e-commerce data domain (ECommerce-T2I). The CLIP ct is just continual training without any other operations. The CLIP f t updates CLIP vit32 using joint-dataset ECommerce-T2I. The ECommerce-T2IYang et al. (2021) is a text-to-image e-commerce dataset that contains 90k training images and 5k testing images set (EC(5K)). Each image corresponds to a description, and the descriptions of training set and test set do not overlap. We set the CLIP vit32 as the initial model in the continuous training process and divide the joint-dataset ECommerce-T2I into five sub-datasets uniformly and randomly to simulate streaming data. The final multimodal retrieval results have been shown in Table8.

The final multimoal retrieval performance of the CLIP ct , CLIP M od-X and CLIP f t based on OpenAI's CLIP vit32 on specific e-commerce data domain (ECommerce-T2I).

7.9.2 THE CHANGES IN THE REPRESENTATION QUALITY OF VISUAL ENCODERS IN EXPERIMENT B. This section discusses the representation quality of the CLIP's visual encoders in Experiment B. As shown in Figure 15, we calculate the vision encoders' linear evaluation in different training phases.

