DYNAMIC HISTORICAL ADAPTATION FOR CONTINUAL IMAGE-TEXT MODELING Anonymous

Abstract

In realistic application scenarios, existing methods for image-text modeling have limitations in dealing with data stream: training on all data needs too much computation/storage resources, and even the full access to previous data is invalid. In this work, we thus propose a new continual image-text modeling (CITM) setting that requires a model to be trained sequentially on a number of diverse image-text datasets. Although recent continual learning methods can be directly applied to the CITM setting, most of them only consider reusing part of previous data or aligning the output distributions of previous and new models, which is a partial or indirect way to acquire the old knowledge. In contrast, we propose a novel dynamic historical adaptation (DHA) method which can holistically and directly review the old knowledge from a historical model. Concretely, the historical model transfers its total parameters to the main/current model to utilize the holistic old knowledge. In turn, the main model dynamically transfers its parameters to the historical model at every five training steps to ensure that the knowledge gap between them is not too large. Extensive experiments show that our DHA outperforms other representative/latest continual learning methods under the CITM setting.

1. INTRODUCTION

In the past few years, image-text modeling has drawn much attention from both academia and industry with a fundamental role in various cross-modal tasks, such as image-text retrieval (Chen et al., 2020a; Lee et al., 2018) , image captioning (Vinyals et al., 2015; Jia et al., 2015) , and text-image generation (Johnson et al., 2018; Qiao et al., 2019) . Although existing image-text modeling methods (Lu et al., 2019; Li et al., 2020; Lei et al., 2021; Yang et al., 2021; Ging et al., 2020; Bain et al., 2021; Huo et al., 2021; Jia et al., 2021) have achieved great success in these tasks, most of them assume that a full (fixed) set of image-text pairs are provided for model training, which actually limits their deployment in realistic application scenarios. That is, the training data often comes in a stream way, and the current widely-used paradigm for image-text modeling faces two limitations: (1) training on all data (i.e., both previous and new data) severely increases the computational and storage overhead; (2) the full access to previous data may be invalid. To overcome these limitations, we thus propose a continual image-text modeling (CITM) setting instead. Concretely, we recollect four diverse image-text datasets respectively from MSCOCO (Lin et al., 2014 ), CC3M (Sharma et al., 2018) , WIT (Srinivasan et al., 2021) and GoodNews (Biten et al., 2019) , each of which is split into the training, validation, and test sets. We adopt the SimCLR-based model (Chen et al., 2020b) as the basic model which is also deployed in OpenAI CLIP (Radford et al., 2021) . Under the CITM setting, the model is sequentially trained on each of the four image-text datasets, and is finally evaluated on all datasets. To demonstrate the well-known catastrophic forgetting problem, we measure the image-to-text retrieval performance with the metric recall@1 (R@1) during sequential training on the four datasets. The results in Figure 1 clearly show that every time the model is trained on a new dataset, its performance on previous datasets has a distinct degradation (i.e., catastrophic forgetting).



Figure 1: The results of catastrophic forgetting under the CITM setting.

