DYNAMIC HISTORICAL ADAPTATION FOR CONTINUAL IMAGE-TEXT MODELING Anonymous

Abstract

In realistic application scenarios, existing methods for image-text modeling have limitations in dealing with data stream: training on all data needs too much computation/storage resources, and even the full access to previous data is invalid. In this work, we thus propose a new continual image-text modeling (CITM) setting that requires a model to be trained sequentially on a number of diverse image-text datasets. Although recent continual learning methods can be directly applied to the CITM setting, most of them only consider reusing part of previous data or aligning the output distributions of previous and new models, which is a partial or indirect way to acquire the old knowledge. In contrast, we propose a novel dynamic historical adaptation (DHA) method which can holistically and directly review the old knowledge from a historical model. Concretely, the historical model transfers its total parameters to the main/current model to utilize the holistic old knowledge. In turn, the main model dynamically transfers its parameters to the historical model at every five training steps to ensure that the knowledge gap between them is not too large. Extensive experiments show that our DHA outperforms other representative/latest continual learning methods under the CITM setting.

1. INTRODUCTION

In the past few years, image-text modeling has drawn much attention from both academia and industry with a fundamental role in various cross-modal tasks, such as image-text retrieval (Chen et al., 2020a; Lee et al., 2018) , image captioning (Vinyals et al., 2015; Jia et al., 2015) , and text-image generation (Johnson et al., 2018; Qiao et al., 2019) . Although existing image-text modeling methods (Lu et al., 2019; Li et al., 2020; Lei et al., 2021; Yang et al., 2021; Ging et al., 2020; Bain et al., 2021; Huo et al., 2021; Jia et al., 2021) have achieved great success in these tasks, most of them assume that a full (fixed) set of image-text pairs are provided for model training, which actually limits their deployment in realistic application scenarios. That is, the training data often comes in a stream way, and the current widely-used paradigm for image-text modeling faces two limitations: (1) training on all data (i.e., both previous and new data) severely increases the computational and storage overhead; (2) the full access to previous data may be invalid. To overcome these limitations, we thus propose a continual image-text modeling (CITM) setting instead. Concretely, we recollect four diverse image-text datasets respectively from MSCOCO (Lin et al., 2014 ), CC3M (Sharma et al., 2018) , WIT (Srinivasan et al., 2021) and GoodNews (Biten et al., 2019) , each of which is split into the training, validation, and test sets. We adopt the SimCLR-based model (Chen et al., 2020b) as the basic model which is also deployed in OpenAI CLIP (Radford et al., 2021) . Under the CITM setting, the model is sequentially trained on each of the four image-text datasets, and is finally evaluated on all datasets. To demonstrate the well-known catastrophic forgetting problem, we measure the image-to-text retrieval performance with the metric recall@1 (R@1) during sequential training on the four datasets. The results in Figure 1 clearly show that every time the model is trained on a new dataset, its performance on previous datasets has a distinct degradation (i.e., catastrophic forgetting). To avoid the drawbacks of the above baseline methods for CITM, we thus propose a novel dynamic historical adaptation (DHA) method which can holistically and directly review the old knowledge from a historical model. The core idea of our DHA is to directly transfer knowledge between the old and new models through parameter interaction. In our DHA, we name the model trained on the current task as the main model, and the best (main) model on the last task as the historical model. During parameter interaction, we directly transfer the parameters of the historical model to the main model and then train the main model with modified parameters on the current task. Meanwhile, we dynamically update the historical model with the guidance of the main model to ensure that the knowledge gap between them is not too large. Specifically, at every five steps, the parameters of the main model are passed to the historical model for parameter modification. Overall, these two parameter transfer strategies make up our DHA method. Compared with existing methods (Li & Hoiem, 2017; Chaudhry et al., 2019; Buzzega et al., 2020; Cha et al., 2021) , our DHA has two advantages: (1) DHA adopts direct parameter transfer instead of indirect model aligning (deployed by regularization-based methods), and thus it is more robust to large domain shifts across the previous and new tasks. (2) DHA holistically reviews the old knowledge from the historical model, which can overcome the drawback of rehearsal-based methods for partial data selection (i.e., partial old knowledge is reused). To our best knowledge, we are the first to propose a direct parameter transfer method to cope with the forgetting problem in the continual learning field. As we have mentioned, we construct a benchmark dataset for the CITM setting by recollecting four diverse image-text datasets respectively from MSCOCO (Lin et al., 2014 ), CC3M (Sharma et al., 2018) , WIT (Srinivasan et al., 2021) and GoodNews (Biten et al., 2019) . Under a fair setting, we compare DHA with a number of baseline methods (Li & Hoiem, 2017; Chaudhry et al., 2019; Buzzega et al., 2020; Cha et al., 2021) on this benchmark dataset. Extensive experiments prove that our DHA outperforms these baseline methods under the CITM setting. Overall, the main contributions of this paper can be summarized as follows: (1) We propose a new continual image-text modeling (CITM) setting for image-text modeling on data stream, which has a realistic application in large-scale multi-modal pre-training (with annual data update) as shown in Figure 2. (2) We devise a novel dynamic historical adaptation (DHA) method under the CITM setting. For the first time, we identify the important role of direct parameter transfer (between the historical and main models) in continual learning. (3) We construct a benchmark dataset of four diverse sets of image-text pairs, which can facilitate the research on CITM. (4) Extensive experiments demonstrate the effectiveness of our DHA under the CITM setting.



Figure 1: The results of catastrophic forgetting under the CITM setting.

Figure 2: Schematic illustration of the realistic application of our proposed CITM setting in largescale multi-modal pre-training (like OpenAI CLIP) with the pre-training data being updated every year. Left: The traditional setting for large-scale pre-training with annual data update. Right: Our CITM setting for large-scale pre-training with annual data update.

