MECTA: MEMORY-ECONOMIC CONTINUAL TEST-TIME MODEL ADAPTATION

Abstract

Continual Test-time Adaptation (CTA) is a promising art to secure accuracy gains in continually-changing environments. The state-of-the-art adaptations improve outof-distribution model accuracy via computation-efficient online test-time gradient descents but meanwhile cost about times of memory versus the inference, even if only a small portion of parameters are updated. Such high memory consumption of CTA substantially impedes wide applications of advanced CTA on memoryconstrained devices. In this paper, we provide a novel solution, dubbed MECTA, to drastically improve the memory efficiency of gradient-based CTA. Our profiling shows that the major memory overhead comes from the intermediate cache for backpropagation, which scales by the batch size, channel, and layer number. Therefore, we propose to reduce batch sizes, adopt an adaptive normalization layer to maintain stable and accurate predictions, and stop the back-propagation caching heuristically. On the other hand, we prune the networks to reduce the computation and memory overheads in optimization and recover the parameters afterward to avoid forgetting. The proposed MECTA is efficient and can be seamlessly plugged into state-of-theart CTA algorithms at negligible overhead on computation and memory. On three datasets, CIFAR10, CIFAR100, and ImageNet, MECTA improves the accuracy by at least 6% with constrained memory and significantly reduces the memory costs of ResNet50 on ImageNet by at least 70% with comparable accuracy. Our codes can be accessed at

1. INTRODUCTION

Many machine-learning applications require deploying well-trained deep neural networks from a large dataset to out-of-distribution (OOD) data and dynamically-changing environments, for example, unseen data variations (Dong et al., 2022; Liu et al., 2023) or corruptions caused by weather changes (Hendrycks & Dietterich, 2019; Koh et al., 2021) . Hence, the recent efforts aim to tackle this emerging research challenge via continual test-time adaptation (CTA). The unsupervised, resourceconstrained, and dynamic test-time environments in CTA make it a challenging learning problem and call for a self-supervised, efficient and stable solution. Decent examples include Tent (Wang et al., 2021) and EATA (Niu et al., 2022) Tent impossible to be adopted on edge devices, for example, the popular board-computer, Raspberry Pi with 1 Gb RAM, and old-generation of smartphones (Ignatov et al., 2018) . Observing the bottleneck on model.backward, a straightforward solution could be reducing batch sizes and model scales (including the number of channels and layers), but there are several obstacles to maintaining model performance simultaneously. First, a large batch size is essential for adaptation (Yang et al., 2022) . Second, the amount of information extracted by a deep and wide model is desired for modeling distributionally-robust semantic features (Hendrycks et al., 2020a) . In this paper, we tackle the aforementioned challenges by proposing a novel approach called Memory-Economic Continual Test-time Adaptation (MECTA). As illustrated in Fig. 1 , our method is enclosed into a simple normalization layer, MECTA Norm, to reduce the three dimensions of intermediate caches: batch, channel and layer sizes. (B) MECTA Norm accumulates distribution knowledge from streaming mini-batches and is stable on small batches and on shifts between domains using a shift-aware forget gate. (C) Resembling sparse-gradient descents, we introduce test-time pruning that stochastically removes channels of cached intermediate results without knowing gradient magnitudes. (L) The forget gate also guides the layer adaptation: if the layer distribution gap is sufficiently small, then the layer will be excluded from memory-intensive training. Our contributions are as follows. • New Problem: We initiate the study on the memory efficiency of continual test-time adaptation, revealing the substantial obstacle in practice. • New Method: We propose a novel method that improves the memory efficiency of different CTA methods. The simple norm layer structure is ready to be plugged into various networks to replace batch-normalization. The MECTA Norm layer also enables us to stop and restart model adaptation without unused or absent caches for unwanted or on-demand back-propagation. Without forgetting due to removing parameters, our pruning is conducted on cache data for back-propagation rather than forwarding and can greatly reduce memory consumption. • Compelling Results: Our method maintains comparable performance to full back-propagation methods while significantly reducing the dynamic and maximal cache overheads. Given limited memory constraints, our method improves the Tent and EATA by 8.5 -73% accuracy on CIFAR10-C, CIFAR100-C, and ImageNet-C datasets.

2. RELATED WORKS

Test-time Adaptation (TTA) aims to improve model accuracy on Out-of-Distribution (OOD) data by adapting models using unlabeled test data. In comparison, traditional learning algorithms train models robustly, e.g., distributionally robust neural networks (Sagawa et al., 2020) or adversarial training (Deng et al., 2023) , in order to generalize to OOD data. Early examples for TTA include the test-time training (TTT) (Sun et al., 2020) and its variant (Liu et al., 2021) which jointly train a source model via both supervised and self-supervised objectives, and then adapt the model via self-supervised objectives at test time. Adaptive risk minimization (Zhang et al., 2021), contextual 



. Early in 2017, Li et al. found that updating batch-normalization (BN) layers with all test-time data without any training greatly improved the model OOD performance. Recently, Tent (Wang et al., 2021) significantly improved the test-time performance by minimizing the prediction entropy in an efficient manner where only a few parameters are updated. More recently, EATA (Niu et al., 2022) improved sample efficiency and evade catastrophic forgetting of the in-distribution data.While Tent and EATA had achieved impressive gains on OOD accuracy via online optimization, such optimizations are accompanied by large memory consumption and are prohibitive in many real-world CTA applications. Since many devices are only designed for on-device inference rather than training, memory-limited devices, like small sensors, cannot afford CTA algorithms. In Fig.1, we demonstrate that Tent/EATA adaptation ofResNet50 (He et al., 2016)  with a batch size of 64 (the default setting in Tent) costs more than 5 times of memory in model.backward as the standard inference onImageNet-C (Hendrycks & Dietterich, 2019). The large peak memory consumption makes EATA or

Figure 1: Demonstration of incremental memory footprints brought by each operation on ImageNet and illustration of the proposed MECTA method, which reduces the cache size of gradient-based adaptation. During forwarding, the MECTA Norm (B) stabilizes the normalization-statistic estimation via shift-aware moving-average from small batches, (C) randomly drops caches by channel admitting sparse gradient descent without knowing gradient in advance, and (L) maintains caches only for layers on demand of training.

availability

https://github.com/SonyAI/MECTA.

