A WEIGHT VARIATION-AWARE TRAINING METHOD FOR HARDWARE NEUROMORPHIC CHIPS Anonymous

Abstract

Hardware neuromorphic chips that mimic the biological nervous systems have recently attracted significant attention due to their ultra-low power and parallel computation. However, the inherent variability of nano-scale synaptic devices causes a weight perturbation and performance drop of neural networks. This paper proposes a training method to find weight with robustness to intrinsic device variability. A stochastic weight characteristic incurred by device inherent variability is considered during training. We investigate the impact of weight variation on both Spiking Neural Network (SNN) and standard Artificial Neural Network (ANN) with different architectures including fully connected, convolutional neural network (CNN), VGG, and ResNet on MNIST, CIFAR-10, and CIFAR-100. Experimental results show that a weight variation-aware training method (WVAT) can dramatically minimize the performance drop on weight variability by exploring a flat loss landscape. When there are weight perturbations, WVAT yields 85.21% accuracy of VGG-5 on CIFAR-10, reducing accuracy degradation by more than 1/10 compared with SGD. Finally, WVAT is easy to implement on various architectures with little computational overhead.

1. INTRODUCTION

Deep Neural Networks (DNN) have achieved remarkable breakthroughs in computer vision, automatic driving, and image/voice recognition (LeCun et al., 2015) . With this success, neuromorphic technology, which mimics the human nervous system, has recently received significant attention in the semiconductor industry. Compared with the conventional von Neumann architecture which has limitations in power consumption and real-time pattern recognition (Schuman et al., 2017; Indiveri et al., 2015) , neuromorphic chips, biologically inspired from the human brain, are new compact semiconductor chips that collocate processing and memory (Chicca et al., 2014; Catherine D. Schuman & Kay, 2022) . Therefore, neuromorphic chips can process highly parallel operations and be suitable for real-time recognizing images, videos, and audios with ultra-low power consumption (Indiveri & Liu, 2015) . Neuromorphic chips are also suitable for "Edge AI computing," which process data in edge devices rather than in the cloud at a data center (Nwakanma et al., 2021) . In other words, tasks that require a large amount of computation, such as training, are performed in the cloud and inference in edge devices. Traditional cloud AI processing requires sufficient computing power and network connectivity. This means that an enormous amount of data transmission is required, likely increasing data latency and transferring disconnections (Li et al., 2020) . It causes severe problems in autonomous driving, robotics, and mobile VR/AR that require real-time processing. Therefore, there is a growing need for data processing on edge devices. Neuromorphic devices are compact, mobile, and energy-efficient, promising candidates for edge computing systems. However, despite enormous advances in semiconductor integrated circuit (IC) technology, hardware neuromorphic implementation and embedded systems with numerous synaptic devices remain challenging (Prezioso et al., 2015; Esser et al., 2015; Catherine D. Schuman & Kay, 2022) . Design considerations such as multi-level state, device variability, programming energy, speed, and array-level connectivity, are required. (Eryilmaz et al., 2015) . In particular, nano-electronic device variability is an inevitable issue originating from manufacturing fabrication (Prezioso et al., 2010) . Although there are many kinds of nano-electronic devices for neuromorphic systems and in-memory computing-including memristor, flash memory, phase-change memory, and optoelectronic deviceswe call them "devices" for readability in this paper. Device variability causes mapped synaptic weight values in hardware to differ undesirably from software weight. This gap between hardware synapse and software weight makes it challenging to implement neural networks in real-world applications. Many recent studies have reported that device variability can significantly reduce the accuracy of neuromorphic hardware and DNN accelerators (Catherine D. Schuman & Kay, 2022; Peng et al., 2020; Joshi et al., 2020b; Sun & Yu, 2019; Kim et al., 2019; 2018) . Although there are various studies to solve this problem, they focus on the unique behaviors of devices (Hennen et al., 2022; vls; Fu et al., 2022) . The diversity of devices used to implement neuromorphic hardware results in the customized solutions required for a given device variation. Therefore, the versatility of customized solutions at the device level is limited. There is a growing need for a hardware-oriented training method to learn parameters robust to device variability. It is widely known that wide and flat loss landscapes lead to improved generalization (Keskar et al., 2017; Li et al., 2018) . It is natural to think that wide and flat loss landscapes with respect to weight will mitigate the accuracy drop caused by device variability. However, we experimentally confirm that related studies (Izmailov et al., 2018; Wu et al., 2020; Foret et al., 2021) can not significantly reduce the accuracy drop by device variation (Experiments are provided later in section 2). This observation reminds us of the need for a hardware-oriented neural network training method. Motivated by this, we propose a weight variation-aware training method (WVAT) that alleviates performance drops induced by device variability at the algorithmic level rather than the device level. This method explores a wide and flat weight loss landscape through the ensemble technique and the hardware-simulated variation-aware update method, which is more tolerant to the software weight perturbation caused by hardware synaptic variability. WVAT can effectively minimize performance drops with respect to weight variations with little additional computational overhead in the training phase. Our contributions include the following: • For the first time to the best of our knowledge, we investigate and analyze the impact of variations in model parameters on performance in several architectures, including standard Artificial Neural Networks (ANN) and Spiking Neural Networks (SNN), which are suitable for hardware neuromorphic implementation due to event-driven spike properties. • By exploring the flatter weight loss landscape, we propose WVAT that is tolerant to intrinsic device variability. We introduce an ensemble technique for better generalization and present a intuitive weight update method with a hardware-simulated variation. This method is also efficient for quantization and input noise, which is one of the hardware implementation issues besides weight perturbations. • We experimentally demonstrate that WVAT achieves nearly similar performance to the typical training method stochastic gradient descent (SGD) while having robustness to variations in model parameters. WVAT is easy to implement with little computational cost. • By presenting an algorithm-level hardware-oriented training method, we expect that WVAT will help the development of society related to hardware implementation, including neuromorphic chipsfoot_0 .

2. BACKGROUND

Many studies have been conducted to develop training methods robust to device variability (Liu et al., 2015; Long et al., 2019; Zhu et al., 2020; Joshi et al., 2020b; Joksas et al., 2022; Huang et al., 2022) . 2019) achieved good performance, this method has a limitation, a binary device (1 bit per cell). However, as mentioned in section 1, the customized solutions for the given device has limitation in applying to the general case. Recently, there have been many studies investigating the effect of loss landscape on generalization (Garipov et al., 2018; Izmailov et al., 2018; Wu et al., 2020; Foret et al., 2021; Liu et al., 2022) . It is



A source code will be available soon.



Liu et al. (2015) proposed adding a penalty for variations in model parameters to training loss, and Long et al. (2019); Zhu et al. (2020) generated a noise model to reflect device variability during a training phase. Although Long et al. (

