HARDWARE-RESTRICTION-AWARE TRAINING (HRAT) FOR MEMRISTOR NEURAL NETWORKS

Abstract

Memristor neural network (MNN), which utilizes memristor crossbars for vectormatrix multiplication, has huge advantages in terms of scalability and energy efficiency for neuromorphic computing. MNN weights are usually trained offline and then deployed as memristor conductances through a sequence of programming voltage pulses. Although weight uncertainties caused by process variation have been addressed in variation-aware training algorithms, efficient design and training of MNNs have not been systematically explored to date. In this work, we propose Hardware-Restriction-Aware Training (HRAT), which takes into account various non-negligible limitations and non-idealities of memristor devices, circuits, and systems. HRAT considers MNN's realistic behavior and circuit restrictions during offline training, thereby bridging the gap between offline training and hardware deployment. HRAT uses a new batch normalization (BN) fusing strategy to align the distortion caused by hardware restrictions between offline training and hardware inference. This not only improves inference accuracy but also eliminates the need for dedicated circuitry for BN operations. Furthermore, most normal scale signals are limited in amplitude due to the restriction of nondestructive threshold voltage of memristors. To avoid input signal distortion of memristor crossbars, HRAT dynamically adjusts the input signal magnitude during training using a learned scale factor. These scale factors can be incorporated into the parameters of linear operation together with fused BN, so no additional signal scaling circuits are required. To evaluate the proposed HRAT methodology, FC-4 and LeNet-5 on MNIST are firstly trained by HRAT and then deployed in hardware. Hardware simulations match well with the offline HRAT results. We also carried out various experiments using VGG-16 on the CIFAR datasets. The study shows that HRAT leads to high-performance MNNs without device calibration or on-chip training, thus greatly facilitating commercial MNN deployment.

1. INTRODUCTION

Memristor neural network (MNN) has emerged as an increasingly feasible option to alleviate the scalability and energy efficiency challenges in neuromorphic computing. While several small-scale MNNs have been prototyped Li et al. (2018) ; Yao et al. (2020) ; Wan et al. (2022) , efficient design and training of MNNs require an in-depth understanding of various restrictions from device, circuit, and system perspectives. These hardware restrictions include weight uncertainty noise caused by memristor variability and a limited number of programming pulse cycles to tune memristor conductance (e.g., 500 in Yao et al. (2020) ), weight quantization noise due to limited states of memristor conductance (e.g., 5-and 4-bit in Yao et al. (2020) ; Wan et al. (2022) ), non-destructive threshold voltage of memristors Jo et al. (2010) , limited output swing of operational amplifiers Karki (2021) , and bias quantization noise from finite-resolution digital-to-analog converters (DACs). These hardware restrictions collectively reduce the accuracy of MNN inference. Ignoring these hardware restrictions during software offline training may result in poor inference or even functional failure. As a critical step in network training, batch normalization (BN) can accelerate training convergence Ioffe & Szegedy (2015) . The scale and shift operations of BN can be merged into the previous linear operation (e.g., fully connected or convolutional layer) after training. In this way, the hardware complexity and cost of MNNs are alleviated, as BN does not require explicit memristor crossbars in on-chip deployment. We envision that the aforementioned hardware restrictions have a significant impact on BN fusion (also known as BN folding) in MNNs. For example, bias distortion caused by DACs and the limited output swing of operational amplifiers should be considered during BN fusion. Although several BN fusing strategies Jacob et al. (2018) ; Krishnamoorthi (2018) ; PyTorch (2022); Wan et al. (2022) have been reported for quantization-aware training, unfortunately, dedicated BN fusion strategies for MNN training and hardware deployment has so far received little attention. As a result, it is imperative to develop hardware-restriction-aware BN fusing strategies to align signal distortion caused by hardware restrictions before and after BN fusion in MNNs. In this work, we propose a Hardware-Restriction-Aware Training (HRAT) method, which takes into account various non-negligible restrictions and non-idealities from device, circuit, and system perspectives. HRAT considers realistic behavior and hardware restrictions of MNNs during offline training, thereby bridging the gap between offline training and hardware deployment. The key contributions of this work are summarized as follows: • We model various hardware restrictions of MNNs and integrate them into training, enabling hardware-restriction-aware training (HRAT). HRAT uses a new BN fusing strategy to align the restriction-induced distortion between offline training and hardware inference. This not only improves inference accuracy but also eliminates the need for dedicated circuitry for BN operation. To avoid input signal distortion of memristor crossbars, HRAT dynamically adjusts the signal magnitude during training using a learned scale factor. These scale factors can be incorporated into the parameters of linear operation together with fused BN, so no additional signal scaling circuits are required. • We conduct various experiments on baseline networks on datasets (MNIST, CIFAR-10, and CIFAR-100) to demonstrate the performance of HRAT. To evaluate the proposed HRAT methodology, FC-4 and LeNet-5 are firstly trained by HRAT and then deployed in hardware. Hardware simulation results match well with the offline HRAT results, indicating that HRAT can bridge the gap between offline training and hardware deployment. To investigate the effectiveness of HRAT on large-scale networks, we conduct experiments using VGG-16 on the CIFAR datasets. Experimental results demonstrate that HRAT can lead to state-of-the-art MNNs without performing prohibitively expensive and time-consuming on-chip retraining, enabling low-cost highperformance MNNs for large-scale commercialization of neuromorphic systems. Here γ is a learnable scale factor in BN, σ 2 is output variance of linear operation, and ϵ is a small constant added to prevent the divide by zero error. Wan et al. (2022) implement BN fusion in a straightforward manner Ioffe & Szegedy (2015) . Jacob et al. (2018) introduce fused linear operation at the current batch scale. During training, the statistics (i.e., mean and variance) of the current batch are extracted before performing the fused linear operation. Then, these statistics are used to update the Moving Average (MA) statistics and generate fused weight and bias. In this way, a fused linear operation is performed using the obtained fused weight and fused bias. Krishnamoorthi (2018) introduce fused linear operation at the MA scale with a correction for the current batch scale. During training, the output of fused linear operation is corrected from the MA scale to the current batch scale by multiplying σ 2 B + ϵ √ σ 2 + ϵ , where σ 2 B is the output variance of linear operation extracted from current batch. Thus, an additional linear operation is required to compute the current batch statistics, which is used to update the MA statistics and correct the fused scale of linear operation. PyTorch (2022) introduces fused linear operation at the MA scale with a plain BN and a correction to the unfused scale. In order to apply a plain BN, the scale of fused linear operation is corrected to the the unfused scale by multiplying √ σ 2 + ϵ γ . Since the strategies Jacob et al. (2018) ; Krishnamoorthi (2018) require two linear operations (i.e., fused linear and linear), they are much more computationally expensive than PyTorch (2022) whose strategy involves one linear operation.

2. RELATED WORK

While these existing BN fusing strategies help align the distortion caused by weight quantization, the hardware restrictions of MNN systems lead to more aspects of distortion during BN fusion. For example, since fused bias and outputs of fused linear operation are represented as voltages in MNNs, effective BN fusion for MNNs needs to ensure that they conform to the limited output swing of operational amplifiers. Otherwise, voltage saturation and clamping of fused bias or outputs of fused linear operation will cause distortion and inference degradation. Therefore, a BN fusing strategy should align the distortion caused by hardware restrictions between training and hardware inference.

3. MNN HARDWARE DEPLOYMENT AND RESTRICTIONS

An MNN consists of multiple interconnected network layers. The hardware schematic for one layer is plotted in Figure 1(a) , where a memristor crossbar implements a layer of synapses, and each offline-trained weight is achieved by the difference in memristor conductance between two differential memristors Prezioso et al. (2015) ; Li et al. (2018); Yao et al. (2020); Wan et al. (2022) , allowing the realization of positive and negative weight. Each offline-trained bias is downloaded into a register circuit, which controls a digital-to-analog converter (DAC) for providing a bias voltage to a neuron summation circuit. The neuron summation circuit and activation circuit work together to generate an output voltage Krestinskaya et al. (2019) . Figure 1 (b) shows the measured current-voltage sweep curve of memristors Yan et al. (2019) . Memristors have two distinct modes of operation: a safe operation mode, in which memristor conductance remains unchanged from its previously programmed value, and a conductance programming mode, in which a series of programming pulse voltages are applied to a memristor until its conductance approaches its expected value. Since MNNs should run in the safe operation mode during inference, the input voltage across memristors cannot exceed the upper voltage limit of its safe operation region, such as 0.2V in Yan et al. (2019) . This upper voltage limit can be viewed as the non-destructive threshold voltage Jo et al. ( 2010), V T H , below which the previously programmed memristor conductance (i.e., offline-trained weight) does not change during inference. The non-destructive threshold voltage varies with the material, fabrication process, and physical structure of memristors. To ensure proper inference, the input voltage clamp circuit in Figure 1 (a) should restrict voltages across the memristor crossbar with [-V T H , V T H ]. Figure 1 (c) shows the schematic of a neuron summation circuit, which consists of transimpedance amplifiers and fully differential amplifier to convert and scale the current difference (i.e., I + -I -) to voltage and then add the DAC generated bias voltage. To avoid the inconvenience caused by dual power supply, the neuron summation circuit is designed to operate with a single positive power supply and set the signal ground level of the neuron summation circuit to be at the half of the supply Limited output swing of operational amplifiers and DACs. The outputs of the amplifier or DAC can only swing within the power supply range (assuming rail-to-rail circuit topologies are used). The circuit outputs will be clamped at the ground or VDD level when the intended signal values exceed the output swing range as shown in Figure 1(d) . Such hardware restriction should be considered during training. Bias quantization noise of finite-resolution DACs. DAC circuits are widely used for bias voltage generation due to their high precision and great flexibility. The output resolution of a DAC (usually specified in bits) represents the smallest output increment that can be produced. When bias voltages are generated by finite-resolution DACs, bias quantization noise potentially degrades the inference performance.

4. HARDWARE-RESTRICTION-AWARE TRAINING (HRAT)

In this section, we present Hardware-Restriction-Aware Training (HRAT), which takes into account various non-negligible restrictions of memristor devices, circuits, and systems. Figure 2 (a) and 2(b) illustrate the HRAT process for a layer without BN and with the proposed BN fusing strategy, respectively. In HRAT, weight parameters are quantized to mimic the process of using a limited number of pulse cycles to program memristor conductance; bias parameters are quantized to mimic bias quantization noise caused by the finite-resolution of DACs; process variation of memristor devices is mimicked by adding weight uncertainty noise, and the limited output swing of operational amplifiers and DACs is mimicked by a clamp function. Furthermore, a trainable scale factor s is added to each layer. In this way, the output signal magnitude of each network layer can be adjusted to a proper range. Key aspects of HRAT will be described in detail in the following subsections. 

5. BASELINE NETWORK ARCHITECTURES AND EXPERIMENTAL SETUP

We choose a four-layer fully connected NN (FC-4), LeNet-5 LeCun et al. (1998), and VGG-16 Simonyan & Zisserman (2015) , as our baseline architectures. We design the miniature model FC-4 for rapid verification of the proposed HRAT algorithm. FC-4 consists of three hidden layers and one classification layer. The three hidden layers have 512, 128, and 32 nodes respectively. LeNet-5 and VGG-16 are implemented from the original papers, only slightly different to accommodate different datasets. For experiments on the CIFAR datasets, the number of features in the hidden layers of LeNet-5 is increased by a factor of 5. For fast convergence and better performance, we also implement batch normalization (BN) layers in each baseline model. Features are normalized via BN except for the fully connected layers in VGG-16. The default supply voltage of MNNs is 3V. Memristor behaviors (e.g., the non-destructive threshold voltage of 0.2V and conductance tuning range of [2µS, 20µS] ) are obtained from the experimental results Yao et al. (2020) . Offline weights are initialized and limited to the range of [-1,1] . After HRAT, offline-trained weights are transformed to the memristor conductance values for hardware deployment. Weight noise follows a Gaussian distribution and is generated according to the weight range, for example, std=0.1 means that the standard deviation of weight noise is equal to 10% of the entire weight range. We use 40 test runs to statistically measure the inference performance of these baseline architectures. 6 EXPERIMENTAL RESULTS AND DISCUSSION 6.1 FC-4 AND LENET-5 ON MNIST Figure 3 (a) plots the mean inference accuracy of FC-4 on MNIST for several combinations of weight noise and weight bitwidth. An inference accuracy of 98.60% is obtained using the software benchmark models (i.e., floating-point or 8-bit quantized weights without MNN hardware restrictions). Mean accuracies of 96.05% and 97.82% are achieved for HRAT without and with signal strength scaling, respectively. The 1.77% accuracy difference reflects the importance of performing signal magnitude scaling in HRAT. If on-chip retraining is performed after HRAT, the mean accuracy rises from 97.82% to 98.40%, which is very close to the software benchmark result (i.e., 98.60%). Although on-chip retraining has been demonstrated in small-scale MNNs Li et al. (2018) ; Wang et al. (2019); Yao et al. (2020) to recover the accuracy drop caused by hardware non-idealities, such approaches require complex analog backpropagation learning circuitry Krestinskaya et al. (2018a; b) , making them unsuitable for cost-effective hardware implementation. Figure 3 (b) plots the standard deviation (std) of inference accuracy for FC-4 on MNIST. HRAT results in an average variance of inference accuracy of 0.39, which drops to 0.07 after performing on-chip retraining. Figure 3(c ) plots the mean inference accuracy of LeNet-5 on MNIST for several combinations of power supply voltage VDD and memristor threshold voltage V T H , assuming zero weight noise and 8-bit quantized weights. Thanks to signal magnitude scaling, HRAT is insensitive to the choice of power supply voltage and memristor threshold voltage, and hence achieves a mean accuracy of 98.84%. Figure 3(d) shows very close inference accuracy of FC-4 and LeNet-5 models on MNIST with HRAT. To validate the proposed HRAT, offline trained FC-4 and LeNet-5 models are implemented in hardware circuits and simulated using the Cadence Spectre tool. Fully-connected layers are implemented according to Figure 1 (a) (Figure 6 in Appendix for details). For convolutional layers, each filter kernel is implemented by a sub-circuit similar to a fully-connected layer. The filter sub-circuit is instantiated multiple times, so all neuron convolutions of a layer are computed simultaneously. ReLU activation and max pooling are also implemented by dedicated analog circuits. BN fused weights and biases, along with scale parameters obtained from HRAT are transformed to memristor conductance values, DAC outputs, and amplifier gains in these analog circuits. To reduce hardware simulation time, we use a macro model of operational amplifiers to capture realistic behaviors, such as limited output swings, finite voltage gain, limited bandwidth, etc. Then, output voltages of each neuron obtained from hardware circuit simulations are compared with their corresponding offline HRAT results. We find hardware simulation matches with offline HRAT. (Appendix A.4 for details) 6.2 VGG-16 ON CIFAR DATASETS Figure 4 (a) plots the mean inference accuracy of VGG-16 on CIFAR-100 for several combinations of weight noise and weight bitwidth. The software benchmark models (i.e., floating-point weights or 8-bit quantized weights) achieve inference accuracies of 68.59% and 66.27%, respectively. Note that software benchmark models are not affected by hardware non-idealities. At a weight noise level of std = 0.025, HRAT with 6-bit quantized weight leads to the highest accuracy of 62.85%. When the weight noise level is std = 0.05, HRAT with 4-bit quantized weight leads to the highest accuracy of 57.94%, which is only 8.33% lower than the software benchmark model (i.e., 66.27%). Note std=0.05 means that the standard deviation of Gaussian weight noise is equal to 5% of the entire weight range, this weight noise level is significant. These results demonstrate that HRAT is robust to weight noise disturbance. Figure 4 (a) also reveals that, for a given weight noise level, there is an optimal weight bitwidth to balance the trade-off between noise immunity and expressiveness of MNN. Lower bitwidth means higher quantization noise is injected into weight during training, thus exhibiting stronger noise immunity during inference. However, lower bitwidth is not always better, because the expressive power of the model is limited Yoon et al. (2022) . This explains why the optimal weight bitwidth in Figure 4 (a) tends to be lower at stronger noise levels. Furthermore, Figure 4 (a) shows that if the signal magnitude scaling is not performed in HRAT, training cannot converge at all. Compared to Figure 3 (a), where a 1.77% drop in accuracy for FC-4 trained by HRAT without signal magnitude scaling, we can see that signal magnitude scaling is more critical and indispensable in large-scale MNNs. As shown in Figure 4 (a), if VGG-16 is retrained on-chip after HRAT, the best accuracy rises to 65.93% and 65.88% for the weight noise levels of std=0.025 and std=0.05, respectively. Both accuracies are close to the software benchmark result (i.e., 66.27%). Figure 4 (a) also shows that the optimal weight bitwidth for on-chip retraining is 8. This is because on-chip retraining is done individually for each deployed MNN, instead of statistically like HRAT. Therefore, for each VGG-16 model under on-chip retraining, weight uncertainty caused by memristor variations becomes deterministic rather than stochastic. As a result, when training a model with deterministic weight uncertainty, the higher the weight bitwidth, the better the retraining result. Figure 4 (b) plots the variance of inference accuracy for VGG-16 on CIFAR-100. As lower bitwidth is more robust to noise disturbance, HRAT with lower bitwidth leads to less variance in inference accuracy. For HRAT with on-chip retraining, since retraining is performed individually with deterministic weight uncertainty, higher bitwidth results in less variance in inference accuracy. The effect of DAC resolution on HRAT is simulated and depicted in Figure 4 (c) and 4(d), where the performance of software benchmark models is also plotted for comparison. Both figures demonstrate that the use of finite-resolution of DACs has a big impact on inference. If an ideal DAC (i.e., infinitely small resolution) is used to generate bias, an increase in weight noise reduces the inference accuracy. For a fixed weight noise level, the reduction in DAC resolution significantly deteriorates inference. Figure 4 (e) shows that very close results can be obtained using 14-bit or ideal DACs (i.e., infinitely small resolution) for bias generation. To maintain high inference accuracy, DAC resolution for VGG-16 on CIFAR-10 and CIFAR-100 needs to be no lower than 9 or 12 bits, respectively. To investigate the effect of using learnable scale factors on HRAT, Figure 4 (f) plots the inference accuracy curves of VGG-16 over 500 epochs on CIFAR-100. Compared with simulation curves using fixed scale factors in HRAT, the learnable scale factors help VGG-16 to converge with 2.42% and 4.71% higher accuracy for ideal DACs and 11-bit DACs, respectively. 

7. CONCLUSION

y = ReLU(C Vc (C Vc ((Q(W ; bw) + W noise )C V T H (x)) + Q(b))) s W deploy = s c • Q(W ; bw) s Bias is divided by s to fuse the scale factor. b deploy = Q(b) s Then, we can program W deploy to memristor crossbars and program b deploy to DACs. Since weight is transformed to the memristor conductance range, an additional signal amplification factor 1/s c should be taken into account. Therefore, the inference is expressed as:  y = ReLU(C VDD/2 (C VDD/2 ( 1/s c • (W deploy + W noise )C V T H (x)) + b deploy )) W f used = γ √ σ 2 + ϵ • W In order to apply a normal BN after fused linear operation, we correct the output to original scale by dividing γ √ σ 2 +ϵ , and then the bias term is added to the output of original scale. Before applying BN, we have: y = Q(W f used ; bw) + W noise )C V T H (x) + Q(b) The result y BN is obtained by applying a normal BN according to Eq. ( 4). The above steps only address fused weight, other hardware restrictions are still not integrated. In order to quantize fused bias and clamp the output range, we separate y BN to fused bias b f used and ȳBN . Finally, we can perform hardware deployment, similar to the steps in Section A.3.1. b f used = β + γ √ σ 2 + ϵ (b -µ)

A.4 CIRCUIT SIMULATION RESULTS

Figure 6 shows the hardware operation of a fully-connected layer. The input signal is clamped to avoid excessive voltage across memristors. Linear operation is executed via a memristor crossbar, where weights are programmed into memristor conductance values. The output of linear operation, along with bias voltages from DACs, are passed to a neuron summation circuit and activation circuit for processing. Table 1 summarizes the difference in results between offline HRAT and hardware circuit simulation for all nodes across all layers of LeNet-5. The second column lists the number of outputs at each layer. We use the MNIST dataset with an input dimension of 28 × 28 × 1 for this experiment. Hence, the output dimension of the first convolutional layer is 24 × 24 × 6. The last three columns summarize the maximum, mean, and standard deviation of the error values for each network layer. It demonstrates that circuit simulation results closely match the offline HRAT results. 



schematic of a one-layer MNN for forward pass. This crossbar implements a layer of synapses. Each synapse consists of two memristors to allow positive and negative weights. Hardware schematic of one MNN layer for forward pass. A crossbar implements a layer of synapses. Each offline-trained weight is achieved by two differential memristors. schematic of a one-layer MNN for forward pass. This crossbar implements a layer of synapses. Each synapse consists of two memristors to allow positive and negative weights. circuit schematic. All amplifiers and digital-to-analog converters (DACs) operate from a single supply voltage and have a limited normal operation region as shown in (d). Thus, node voltages at V M1 , V M2 , V M3 , V bias and V S are clamped into [0, VDD]. (b) Measured current and voltage sweep curve of memristors Yan et al. (2019) Hardware schematic of a one-layer MNN for forward pass. This crossbar implements a layer of synapses. Each synapse consists of two memristors to allow positive and negative weights. (c) Measured current-voltage sweep curve of memristor from Yan et al. (2019), whose nondestructive voltage range is [-0.2V, 0.2V] and threshold voltage is 0.2V. (d) The output voltages of amplifier circuits and DACs are limited between 0V circuit schematic. All amplifiers and digital-to-analog converters (DACs) operate from a single supply voltage and have a limited normal operation region as shown in (d). Thus, node voltages at V M1 , V M2 , V M3 , V bias and V S are clamped into [0, VDD].(c) A neuron summation circuit. It uses resistors and operational amplifiers to scale the current difference of differential memristors, and then adds a bias voltage from a digital-to-analog converter (DAC).

schematic of a one-layer MNN for forward pass. This crossbar implements a layer of synapses. Each synapse consists of two memristors to allow positive and negative weights. (c) Measured current-voltage sweep curve of memristor from Yan et al. (2019), whose nondestructive voltage range is [-0.2V, 0.2V] and threshold voltage is 0.2V. circuit schematic. All amplifiers and digital-to-analog converters (DACs) operate from a single supply voltage and have a limited normal operation region as shown in (d). Thus, node voltages at V M1 , V M2 , V M3 , V bias and V S are clamped into [0, VDD]. (d) Output voltage swing of amplifiers and DACs is limited between 0V and VDD (the supply voltage).

Figure 1: MNN hardware overview and deployment restrictions.

Figure 2: Hardware-restriction-aware training (HRAT) for MNNs.

Figure 3: Inference performance of MNN FC-4 and LeNet-5 models on MNIST.

Figure 4: Inference performance of MNN VGG-16 models on CIFAR datasets.

6) where C Vc (•) clamps input to the scaled signal range of [-V c , V c ], V c = s • VDD/2 , W noise is generated from a Gaussian distribution to model memristor variations, s is a learnable scale factor. Q(b) quantizes b according to the bitwidth of DAC and the scaled signal range of [-V c , V c ]. In fact, we use uniform noise instead of quantization noise to introduce randomness. After training, weight is divided by s to obtain scaled weight. Since a memristor usually has a certain conductance range to tune (e.g., [2µS, 20µS] Yao et al. (2020)), scaled weight is transformed to a memristor conductance by multiplying a converting factor s c as:

Figure 2(b) shows a fully-connected layer trained with the proposed BN fusing strategy. Deployment noise is applied on fused weight, through introducing a factor γ √ σ 2 +ϵ , we fuse weight as:

) ȳBN = y BN -b f used (13) Then, other hardware restrictions are applied based on the illustration in Figure 2(b). After training, the BN and s are fused into previous linear operation. So we haveW deploy = s c • Q( γ √ σ 2 +ϵ W ; bw) s (14) b deploy = Q(b f used ) s (15)

Figure 6: Schematic of a fully-connected layer for memristor neural network.

BN fusing strategies for quantization-aware training. Several BN fusing strategies have been proposed for quantization-aware training (Figure5in Appendix for details). By incorporating batch normalization into linear operation (e.g., fully connected layer or convolutional layer), these strategies convert W (i.e., weight of linear operation) into W γ √ σ 2 + ϵ (i.e., fused weight).

All-layer output comparison of LeNet-5 between offline HRAT and hardware simulation.

lists the mean inference accuracy of VGG-16 on CIFAR datasets, when four weight noise levels (std=0, std=0.025, std=0.5, and std=0.075) are present. VGG-16 is either trained by HRAT, or by HRAT followed by on-chip retraining. The optimal weight bitwidth is selected to report the highest inference accuracy for each case. The inference accuracy of HRAT slowly drops as the weight noise level increases. At a strong weight noise level (std=0.05), HRAT achieves a high accuracy of 87.24% for CIFAR-10 dataset. If on-chip retraining is applied after HRAT, the inference accuracy is close to the baseline (i.e., weight noise std=0) results.

Performance comparison among various weight noise levels for VGG-16 on CIFAR datasets, assuming very high-resolution DACs are used for bias generation.

annex

In this work, we uniformly quantize weights using the following quantization function:where ∆ is a trainable parameter of the quantization function to denote the quantization step size, ⌊•⌉ denotes the rounding operation, bw is a given bitwidth, andDue to the non-differentiation of rounding operation, weight and quantization function cannot be trained through gradient-based optimizers. To address this issue, Eq. ( 1) is converted into a differentiable form:where the rounding noise δ rounding = ⌊ W ∆ ⌉ -W ∆ . We assume that δ rounding is an independent noise, so the gradient of the non-differential δ rounding is not calculated during back-propagation. This method is also known as the Straight-Through Estimator (STE) trick. Thus, the gradient of W and ∆ is obtained through the back-propagation algorithm. Therefore, we can easily train memristor neural networks with gradient-based optimizers.

A.2 EXISTING BN FUSING STRATEGIES

A linear layer with BN is expressed as:andWhere x is the input, W is the weight parameter, b is the bias term, γ is the learnable scale factor of BN, β is the bias term of BN, µ B is mean output of the current batch, and σ 2 B is the variance of the current batch. During inference, MA statistics µ and σ 2 are used for normalization. For full precision software models, BN is fused into the linear operation via combining (3) and (4). Then, the inference can be calculated with fused parameters (fused weight and fused bias) as:where the fused weight W f used = γ 

