

Abstract

Emerging edge intelligence applications require the server to continuously retrain and update deep neural networks deployed on remote edge nodes to leverage newly collected data samples. Unfortunately, it may be impossible in practice to continuously send fully updated weights to these edge nodes due to the highly constrained communication resource. In this paper, we propose the weight-wise deep partial updating paradigm, which smartly selects only a subset of weights to update at each server-to-edge communication round, while achieving a similar performance compared to full updating. Our method is established through analytically upper-bounding the loss difference between partial updating and full updating, and only updates the weights which make the largest contributions to the upper bound. Extensive experimental results demonstrate the efficacy of our partial updating methodology which achieves a high inference accuracy while updating a rather small number of weights.

1. INTRODUCTION

To deploy deep neural networks (DNNs) on resource-constrained edge devices, extensive research has been done to compress a well-trained model via pruning (Han et al., 2016; Renda et al., 2020) and quantization (Courbariaux et al., 2015; Rastegari et al., 2016) . During on-device inference, compressed networks may achieve a good balance between model performance (e.g., prediction accuracy) and resource demand (e.g., memory, computation, energy). However, due to the lack of relevant training data or an unknown sensing environment, pre-trained DNN models may not yield satisfactory performance. Retraining the model leveraging newly collected data (from edge devices or from other sources) is needed for desirable performance. Example application scenarios of relevance include vision robotic sensing in an unknown environment (e.g., Mars) (Meng et al., 2017) , local translators on mobile phones (Bhandare et al., 2019) , and acoustic sensor networks deployed in Alpine environments (Meyer et al., 2019) . It is mostly impossible to perform on-device retraining on edge devices due to their resourceconstrained nature. Instead, retraining often occurs on a remote server with sufficient resources. One possible strategy to continuously improve the model performance on edge devices is a two-stage iterative process: (i) at each round, edge devices collect new data samples and send them to the server, and (ii) the server retrains the network using all collected data, and then sends the updates to each edge device (Brown & Sreenan, 2006 ). An essential challenge herein is that the transmissions in the second stage are highly constrained by the limited communication resource (e.g., bandwidth, energy) in comparison to the first stage. State-of-the-art DNN models always require tens or even hundreds of mega-Bytes (MB) to store parameters, whereas a single batch of data samples (a number of samples that can lead to reasonable updates in batch training) needs a relatively smaller amount of data. For example, for CIFAR10 dataset (Krizhevsky et al., 2009) , the weights of a popular VGGNet require 56.09MB storage, while one batch of 128 samples only uses around 0.40MB (Simonyan & Zisserman, 2015; Rastegari et al., 2016) . As an alternative, the server sends a full update once or rarely. But in this case, every node will suffer from a low performance until such an update occurs. Besides, edge devices could decide on and send only critical samples by using active learning schemes (Ash et al., 2020) . The server may also receive training data from other sources, e.g., through data augmentation or new data collection campaigns. These considerations indicate that the updated weights which are sent to edge devices by the server at the second stage become a major bottleneck. To resolve the above challenges pertaining to updating the network, we propose to partially update the network through changing only a small subset of the weights at each round. Doing so can significantly reduce the server-to-device communication overhead. Furthermore, fewer parameter updates also lead to less memory access on edge devices, which in turn results in smaller energy consumption related to (compressed) full updating (Horowitz, 2014) . Our goal of performing partial updating is to determine which subset of weights shall be updated at each round, such that a similar accuracy can be achieved compared to fully updating all weights. Our key concept for partial updating is based on the hypothesis, that a weight shall be updated only if it has a large contribution to the loss reduction given the newly collected data samples. Specially, we define a binary mask m to describe which weights are subject to update, i.e., m i = 1 implies updating this weight and m i = 0 implies fixing the weight to its initial value. For any m, we establish an analytical upper bound on the difference between the loss value under partial updating and that under full updating. We determine an optimized mask m by combining two different view points: (i) measuring the "global contribution" of each weight to the upper bound through computing the Euclidean distance, and (ii) measuring each weight's "local contribution" within each optimization step using gradient-related information. The weights to be updated according to m will be further sparsely fine-tuned while the remaining weights are rewound to their initial values. Related Work. Although partial updating has been adopted in some prior works, it is conducted in a fairly coarse-grained manner, e.g., layer-wise or neuron-wise, and targets at completely different objectives. Especially, under continual learning settings, (Yoon et al., 2018; Jung et al., 2020) propose to freeze all weights related to the neurons which are more critical in performing prior tasks than new ones, to preserve existing knowledge. Under adversarial attack settings, (Shokri & Shmatikov, 2015) updates the weights in the first several layers only, which yield a dominating impact on the extracted features, for better attack efficacy. Under architecture generalization settings, (Chatterji et al., 2020) studies the generalization performance through the resulting loss degradation when rewinding the weights of each individual layer to their initial values. Unfortunately, such techniques cannot be applied in our problem setting which seeks a fine-grained, i.e., weight-wise, partial updating given newly collected training samples in an iterative manner. The communication cost could also be reduced through some other techniques, e.g., quantizing/encoding the updated weights and the transmission signal. But note that these techniques are orthogonal to our approach and could be applied in addition. Also note that our defined partial updating setting differs from the communication-efficient distributed (federated) training settings (Lin et al., 2018; Kairouz et al., 2019) , which study how to compress multiple gradients calculated on different sets of non-i.i.d. local data, such that the aggregation of these (compressed) gradients could result in a similar convergence performance as centralized training on all data. Traditional pruning methods (Han et al., 2016; Frankle & Carbin, 2019; Renda et al., 2020) aim at reducing the number of operations and storage consumption by setting some weights to zero. Sending a pruned network (non-zero's weights) may also reduce the communication cost, but to a much lesser extent as shown in the experimental results, see Section 4.4. In addition, since our objective namely reducing the server-to-edge communication cost when updating the deployed networks is fundamentally different from pruning, we can leverage some learned knowledge by retaining previous weights (i.e., partial updating) instead of zero-outing (i.e., pruning). Contributions. Our contributions can be summarized as follows. • We formalize the deep partial updating paradigm, i.e., how to iteratively perform weightwise partial updating of deep neural networks on remote edge devices if newly collected training samples are available at the server. This substantially reduces the computation and communication demand on the edge devices. • We propose a new approach that determines the optimized subset of weights that shall be selected for partial updating, through measuring each weight's contribution to the analytical upper bound on the loss reduction. • Experimental results on three popular vision datasets show that under the similar accuracy level, our approach can reduce the size of the transmitted data by 91.7% on average (up to 99.3%), namely can update the model averagely 12 times more frequent than full updating.

2. NOTATION AND SETTING

In this section, we define the notation used throughout this paper, and provide a formalized problem setting, i.e., deep partial updating. We consider a set of remote edge devices that implement on-device

