NOISE INJECTION NODE REGULARIZATION FOR RO-BUST LEARNING

Abstract

We introduce Noise Injection Node Regularization (NINR), a method of injecting structured noise into Deep Neural Networks (DNN) during the training stage, resulting in an emergent regularizing effect. We present theoretical and empirical evidence for substantial improvement in robustness against various test data perturbations for feed-forward DNNs when trained under NINR. The novelty in our approach comes from the interplay of adaptive noise injection and initialization conditions such that noise is the dominant driver of dynamics at the start of training. As it simply requires the addition of external nodes without altering the existing network structure or optimization algorithms, this method can be easily incorporated into many standard architectures. We find improved stability against a number of data perturbations, including domain shifts, with the most dramatic improvement obtained for unstructured noise, where our technique outperforms existing methods such as Dropout or L 2 regularization, in some cases. Further, desirable generalization properties on clean data are generally maintained.

1. INTRODUCTION

Nonlinear systems often display dynamical instabilities which enhance small initial perturbations and lead to cumulative behavior that deviates dramatically from a steady-state solution. Such instabilities are prevalent across physical systems, from hydrodynamic turbulence to atomic bombs (see Jeans & Darwin (1902) ; Parker (1958) ; Chandrasekhar (1961) ; Drazin & Reid (2004) ; Strogatz (2018) for just a few examples). In the context of deep learning (DL), DNNs, once optimized via stochastic gradient descent (SGD), suffer from similar instabilities as a function of their inputs. While remarkably successful in a multitude of real world tasks, DNNs are often surprisingly vulnerable to perturbations in their input data as a result (Szegedy et al., 2014) . Concretely, after training, even small changes to the inputs at deployment can result in total predictive breakdown. One may classify such perturbations with respect to the distribution from which training data is implicitly drawn. This data is typically assumed to have support over (the vicinity of) some lowdimensional submanifold of potential inputs, which is only learned approximately due to the discrete nature of the training set. To perform well during training, a network need only have well-defined behavior on the data manifold, accomplished through training on a given data distribution. However, data seen on deployment can display other differences with respect to the training set, as illustrated § Equal contribution in Fig. 1 . These distortions introduce vulnerabilities that are a crucial drawback of trained DNNs, making them susceptible to commonly occurring noise which is ubiquitous in real-world tasks. By studying how networks dynamically act to mitigate the negative effects of input noise, we identify a novel dynamical regularization method starting in a noise-dominated regime, leading to more robust behavior for a range of data perturbations. This is the central contribution of this work. Background: Regularization involves introducing additional constraints in order to solve an illposed problem or to prevent over-fitting. In the context of DL problems, different regularization schemes have been proposed (for a review, see Kukačka et al. (2018) and references therein). These methods are designed to constrain the network parameters during training, thereby reducing sensitivity to irrelevant features in the input data, as well as avoiding overfitting. For instance, weight norm regularization (L 2 , L 1 , etc.) (Cortes & Vapnik, 1995; Zheng et al., 2003) can be used to reduce overfitting to the training data, and is often found to improve generalization performance (Hinton, 1987; Krogh & Hertz, 1991; Zhang et al., 2018) . Alternatively, introducing stochasticity during training (e.g., Dropout (Srivastava et al., 2014) ), has become a standard addition to many DNN architectures, for similar reasons. These methods are mostly optimized to reduce the generalization error from training to test data, under the assumption that both are sampled from the same underlying distribution (Srivastava et al., 2014) . Here, we propose a new method which is instead tailored for robustness. Our method relies on noise-injection, that actively reduces the sensitivity to uncorrelated input perturbations. Our contribution: In this paper, we employ Noise Injection Nodes (NINs), which feed random noise through designated optimizable weights, forcing the network to adapt to layer inputs which contain no useful information. Since the amount of injected noise is a free parameter, at initialization we can set it to be anything from a minor perturbation to the dominant effect, leading to a system breakdown for extreme values. The general behavior of NINs and how they probe the network is the main goal of Levi et al. (2022) , while we focus here on their regularizing properties in different noise injection regimes. The results of Levi et al. (2022) are explicitly recast in the context of regularization in App. C for a linear model, which captures the main insights. Our study suggests that within a certain range of noise injection parameter values, this procedure can substantially improve robustness against subsequent input corruption and partially against other forms of distributional shifts, where the maximal improvement occurs for large noise injection magnitudes approaching the boundary of this window, above which the training accuracy degrades to random guessing. To the best of our knowledge, this regime has not been previously explored. In the following, we analyze how the addition of NINs produces a regularization scheme which we call Noise Injection Node Regularization (NINR). The main features of NINR are enhanced stability, simplicity, and flexibility, without drastically compromising generalization performance. In order to demonstrate these features, we consider two types of feed-forward architectures: Fully Connected Networks (FCs) and Convolutional Neural Networks (CNNs), and use various datasets to train the systems. We compare NINR robustness improvement with standard regularization methods, as well as performance of these systems when using input corruption during training (CDT). Our results Test accuracy vs. the scale of input noise corruption defined in Eq. ( 7) is shown for L2, Dropout, in-NINR, full-NINR and CDT with σnoise = 0.4. Shades indicate 2 standard deviations estimated over 10 distinct runs. For the input-NINR (full-NINR) fully-connected implementations we take σϵ = 51.8 (16.4) in the decay phase and σϵ = 231.6 (51.8) in the catapult phase. Similarly, for the convolutional implementations we take σϵ = 2.8 (0.9) in the decay phase and σϵ = 87.5 (62) in the catapult phase. This key result illustrates that NINR significantly increases the robustness of generic architectures trained on the FMNIST dataset, while marginally affecting generalization (σnoise = 0). Comparing the CDT and input-NINR curves demonstrates the advantage of our regularization method. While both techniques perform similarly well on data corruption of σnoise = 0.4, CDT is significantly worse on clean data. This is a result of the CDT network being forced to fit both noise and data, without the ability to suppress the latter, a crucial attribute of NINR. Here, the learning rate is fixed to η = 0.05 with mini-batch size B = 128. can easily be generalized to other architectures and more complex NINR topologies. In Fig. 2 , we present our main results, comparing networks trained on the FMNIST dataset and demonstrating improved robustness against input perturbations without compromising generalization on clean data. The paper is organized as follows. In Sec. 2 we briefly review important analytical and empirical results that are explored in depth in the work of Levi et al. (2022) , demonstrating in this work how NINs implicitly generate adaptive regularization terms in the loss function. In Sec. 3, we empirically study the effectiveness of NINR. We begin by evaluating its effect on robustness against perturbations, including domain shifts and those adversarially designed, demonstrating the enhanced performance of NINR. We then verify that generalization performance on clean data is not hindered by training with NINR. We discuss related work in Sec. 4, finally concluding in Sec. 5.

2. NOISE INJECTION NODES REGULARIZATION

In the following sections we explain how an effective regularization scheme against input corruption naturally emerges as a consequence of adding a NIN to a DNN. First, we discuss how the NIN generates implicit regularization terms directly from computing the effective loss function. Then, we review the adaptive nature of these terms as they relate to Noise Injection Weight (NIW) dynamics during training, and discuss the expected robustness gains depending on the evolution of the NIWs.

2.1. EMERGENT REGULARIZATION TERMS

In order to see how NINs generate implicit regularization terms, we study a vanilla feed-forward DNN setup. Consider a supervised learning problem modeled by a neural network optimized under SGD, with an associated single sample loss function, L : R din → R. The loss depends on the model parameters θ = {W (ℓ) , b (ℓ) |ℓ = 0, ..., N L -1}, where N L is the number of layers, and the weights and biases associated with a given layer are W (ℓ) ∈ R d ℓ ×d ℓ+1 , b (ℓ) ∈ R d ℓ+1 . At each SGD iteration, a mini-batch B consists of a set of labeled examples, {(x i , y i )} |B| i=1 ∈ R din × R d label . The addition of a NIN in a given layer, ℓ NI , corresponds to a random scalar input, ϵ ∈ R, sampled repeatedly for each SGD training epoch from a chosen distributionfoot_0 , connected via NIWs W NI ∈ R 1×d ℓ NI +1 . We define for a given layer ℓ, the preactivation z (ℓ) = W (ℓ) x (ℓ) + b (ℓ) , therefore the addition of a NIN to a dense layer results in a translation to the preactivation at ℓ NI , as z (ℓNI) → z (ℓNI) + ϵW NI . The batch-averaged loss function including a NIN can be written as a series expansionfoot_1 in the noise translation parameter ϵW NI , L(θ, W NI ) = 1 |B| {x,y,ϵ}∈B L(θ, W NI ; x, ϵ, y) = 1 |B| {x,y,ϵ}∈B e ϵW T NI ∇ z (ℓ NI ) L(θ; x, y). Equation ( 1) follows from noting that the NIN induced translation can be written as an operator. For further details see App. B. Expanding in the parameter ϵW NI , we obtain an infinite series given by L(θ, W NI ) = L(θ) + ∞ k=1 R k (θ, W NI ). Here, L(θ) is the loss function in the absence of any NIN, while R k are batch-averaged derivatives of the loss function with respect to the preactivations at the noise injected layer, R k (θ, W NI ) ≡ 1 |B| {x,y,ϵ}∈B (ϵW T NI • ∇ z (ℓ NI ) ) k k! L(θ; x, y). These functions are products of the moments of the injected noise, the values of the NIWs themselves, and preactivation derivatives of the loss function in the absence of injected noise. It is impossible to estimate when a perturbative analysis in ϵ is valid without specifying L(θ; x, y), as all R k may become equally important, or the series itself may not converge. Furthermore, since we will be interested in rather large values of ϵ, where the effect of higher R k terms is noticeable, the validity of the perturbative calculation is called into question even further. However, in order to gain intuition, we first study how the training procedure is altered by the NIN in the limit of small ϵ ≪ 1. To make further progress, we will later validate our analysis below using a combination of empirical tests, and an investigation of a linear toy model where R k = 0 for k > 2. For sufficiently small ϵ and analytic activation and loss functions, the series converges and the full loss is well-approximated by the first two leading terms in ϵ. For the rest of this work, we consider noise sampled from a distribution with zero mean, relaxing this assumption only for some empirical results in App. D.2. Under this assumption the first two terms can be cast into simple forms, R 1 = W T NI • ⟨ϵg ℓNI ⟩, R 2 = 1 2 W T NI ⟨ϵ 2 H ℓNI ⟩W NI . Here, batch averaging is denoted by ⟨• • •⟩, while g ℓNI = ∇ z (ℓ NI ) L(θ, x, y) and H ℓNI = ∇ z (ℓ NI ) ∇ T z (ℓ NI ) L(θ, x, y) are the network-dependent local gradient and local Hessian, respectively. As proven in Levi et al. (2022) , the magnitude of R 1 can then be estimated using ⟨ϵg ℓNI ⟩ 2 ∼ σ 2 ϵ ⟨g 2 ℓNI ⟩/|B|, where ⟨g 2 ℓNI ⟩ is the vector of the batch-averaged squared values of the local gradients and σ 2 ϵ is the variance of the injected noisefoot_2 , while R 2 may be estimated using ⟨ϵ 2 H ℓNI ⟩ ≈ σ 2 ϵ ⟨H ℓNI ⟩ up to corrections scaling as O( 1/|B|). While R 1 may take both positive and negative values, the sign of H ℓNI depends on the network architecture. Since the spectrum of the local Hessian is generally unknown, our analytical results are only valid for certain limiting cases. Particularly, we focus on the case of Mean Squared Error (MSE) loss and linear activation functions, where we find that the local Hessian H is a positive semi-definite (PSD) matrix, implying that R 2 is a strictly non-negative penalty term, and R k terms with k > 2 vanish identically. For the motivated case of piecewise linear activations, it was shown in Botev et al. (2017) that the local Hessian is PSD, aside from non-analytical points, hinting that R 2 acts as a regularizer for these networks as well. This implies that for networks with piecewise linear activations and MSE loss, an analysis similar to ours below, which keeps only the first two terms in the expansion of Eq. ( 2), is expected to hold not only for small ϵ, but also for large values. We will use this construction to understand how these terms evolve during training in the next section. The interpretation of R 1 and R 2 can now be made clear: R 1 induces a constrained random walk for in the norm of noise injection weights as well as for the data weights at layers ℓ > ℓ NI , with a step size that changes according to the local gradient during training. On the other hand, R 2 , which doesn't depend on |B|, can be understood as a straightforward regularization term for the local Hessian, working to reduce its eigenvalues. These results imply that in the limit of large batch size, and in particular full batch SGD (i.e., gradient descent), regularization via R 2 is dominant.foot_3 . Further understanding of why pushing the local Hessian to smaller eigenvalues is expected to reduce the sensitivity to noise corruption comes by looking at the loss for corrupted inputs. Consider therefore a network without a NIN but with corrupted inputs, described by the substitution, x → x + δ, with δ a random vector. To arrive at similar expressions to Eqs. ( 1) to ( 4), one can transform the preactivations z (0) → z (0) + W (0) δ to obtain, L(θ)| x→x+δ = 1 |B| {x,y}∈B e δ T W (0) •∇ z (0) L(θ; x, y), Under the assumption that the components of the vector δ are drawn i.i.d. from N (0, σ 2 δ ), the first two terms above assume simple formsfoot_4 , similar to Eq. ( 4), R 1 = ⟨δ T W (0) • g 0 ⟩, R 2 = 1 2 σ 2 δ Tr (W (0) ) T ⟨H 0 ⟩W (0) . As before, for sufficiently large |B|, R 1 is subdominant and the main regularization term due to the noise is dictated by H 0 . Thus if a NIN is inserted to the first layer, it will act to reduce H 0 and thereby reduce the sensitivity to data corruption. Furthermore, since DNN structure in general, and loss function in particular, couples the input layer to all succeeding layers, H 0 contains information about deeper layers and will benefit from reducing the local Hessian away from the input layer. In Sec. 3 we show results for NINs coupled to the input layer or to all layers. The above readily generalizes in this setup, resulting in multiple emergent regularization terms, which we briefly discuss in App. B. Despite the similarities in their descriptions, we stress that a system trained on corrupted data and a system with a NIN are not the same. In the former the noise cannot be dynamically reduced without dramatically altering the optimization trajectory, implying that the DNN is not expressive enough to memorize the full data information (Ziyin et al., 2022) . Conversely, in the latter, the noise has its own weights and the system can therefore improve by suppressing them without harming generalization. Nonetheless, both systems are driven towards regions with smaller local Hessian eigenvalues.

2.2. EVOLUTION OF NOISE INJECTION WEIGHTS

The dynamical nature of the NINR and the corresponding NIWs strongly depends on the noise distribution, parameterized in this study by σ ϵ . While the NIWs are updated with each learning step, only under certain conditions is their impact on the network performance actively suppressed as the training progresses. Below we briefly describe four distinct phases of the NIWs. These are illustrated in Fig. 3 , where we show the evolution of the relevant quantities (weights, loss, accuracy) for a model trained on FMNIST, demonstrating the different behavior in each phase. A complete treatment of these phases is discussed in Levi et al. (2022) , while a brief derivation relating them with regularization is given in App. C for a linear network. Decoupled phase. For σ ϵ ≪ 1 one has R 1 ≫ R 2 , and the correction to the loss function may assume positive and negative contributions. As a consequence, the NIWs follow a small-step random walk without substantially affecting the behavior of the network. Decay phase. For larger but not too large σ ϵ , one may ensure R 2 > R 1 at initialization while the NIN can still be treated perturbatively. In this regime, the NIWs initially experience exponential decay until R 2 ∼ R 1 , at which point they evolve according to the stochastic gradient. It is in this phase that one can begin to see noticeable improvement in robustness, with only minor slowing of the training. Increasing σ ϵ boosts the improvement until another phase is encountered. Catapult phase. The discrete nature of the training algorithm will result in a stiff numerical regime at sufficiently large σ ϵ . Above a critical value (for the linear network discussed in App. C we find σ ϵ,cat ∼ 2d ℓNI /η where d ℓNI is the dimension of the NIN layer and η is the learning rate), the effect of the NIN on the network is so significant that it causes an initial increase of the data weights, which in turn leads to an exponential increase for the loss function, followed by a recovery to a new minimumfoot_5 . The improvement in robustness is most extreme in this phase; however, the convergence of the network is slowed somewhat, rendering the usefulness of this phase to only some applications. It is possible that a scheduled increase of the training rate after the recovery from the initial increase in the data weights could speed up the convergence. We leave such investigation to future work. Divergent phase. Further increasing σ ϵ leads the DNN to a breakdown of the dynamics, where the network is unable to suppress the NIN and thus cannot learn any information. The above discussion of phases as a function of σ ϵ should be taken as schematic. Other hyperparameters, such as the batch size, may also influence the phase diagram. Nonetheless, we empirically observe these phases repeating across multiple architectures and tasks, and find them to broadly capture the evolution of the NIWs. Overall, the decay and catapult phases are expected to produce an increase in robustness against input perturbations, and we empirically verify this expectation in the following sections. While we only have an analytic prediction of σ ϵ,cat for a simple linear network, in other architectures it can also be obtained empirically using only the training data.

3. EXPERIMENTS

In this section, we empirically show the effect of NINR on robustness for the different phases of noise injection, following similar methodologies to Hoffman et al. (2019) . After discussing the two different architectures used in this paper, we begin our investigation by demonstrating that in certain cases NINR provides a significant increase in robustness against corruption of input data by random perturbations. We then discuss the performance of NINR for domain shifts, demonstrating its effectiveness. Next, we verify that NINR does not drastically reduce the network accuracy at the original task (e.g., before corruption). This is equivalent to ensuring the generalization properties of the network are not harmed due to the addition of NINs. In the main text we present results mostly for the FMINST dataset (Xiao et al., 2017) . These results also extend to more complex scenarios, demonstrated in similar experiments for the CIFAR-10 ( Krizhevsky et al., 2014) dataset in App. E, while evidence for improvement against adversarial attacks is given in App. D as well as results for other noise distributions and optimizers beyond SGD. Throughout this section, we compare NINR to both unregularized DNNs, and networks explicitly regularized using L 2 or Dropout. We also compare NINR to implicit regularization by training with varying amounts of input data corruption. For all of our experiments, we use either an FC or a CNN (see Fig. 2 and App. A for full details). We optimize using vanilla SGD with cross-entropy loss. We preprocess the data by subtracting the mean and dividing by the variance of the training data, as is done for all subsequent datasets. The learning rate is fixed to η = 0.05 with mini-batch The hyperparameters are chosen to match reference implementations: the L 2 regularization coefficient (weight decay) is set to λ WD = 5 • 10 -4 and the dropout rate is set to p drop = 0.5. When using L 2 or Dropout, they are applied at/after each layer. When using input CDT as a regularization method, we corrupt the input data according to Eq. ( 7) below. We further stress that once a NIN has been added to the network, no further modifications to the training algorithm or architecture are required, and after choosing where to connect the NIN, the only free parameter is the injected noise variance σ 2 ϵ .

3.1. REALIZATIONS IN DIFFERENT ARCHITECTURES

The way in which NINR is implemented depends on the type of layer to which the NIN is connected. Here, we comment on the two different realizations of NINR used in our experiments and depicted in Fig. 4 . For both realizations we consider two distinct topologies: either we add a NIN at the input layer (in-NINR) or we couple the NIN to every hidden layer including the input one (full-NINR). Fully Connected Layers In the case of dense FC layers, we implement NINR (which we denote by fcNINR) by extending the input vector by an additional noisy pixel ϵ, initialized randomly per sample at each training epoch, and densely connecting the modified input vector to the next layer. The theoretical discussion in Sec. 2 was derived for a realization of this type. Convolutional Layers Connecting a NIN at the input of a convolutional layer raises the need for a procedure for Convolutional NINR (cNINR). Since FC layers are insensitive to the input image geometry, taking x → {x, ϵ} is tantamount to adding a noise mask for the entire input. In a CNN, the same interpretation can be maintained by adding the noise to the input directly in a pixel-wise fashion, x (ℓ) → x (ℓ) + W NI • ϵ , which is subsequently fed into the convolutional layer. Importantly, this modification preserves the form of the original layer, while converging to the original x (ℓ) for either σ ϵ → 0 or ||W NI || → 0. This can also be thought of as adding an auxiliary layer that is a non-dynamic identity matrix from the perspective of all data weights, while being densely connected from the perspective of the NINfoot_6 . 3.2 ROBUSTNESS AGAINST DISTRIBUTIONAL SHIFTS

3.2.1. INPUT CORRUPTION

Often, training is done with examples taken in ideal conditions, which would not always exist in real-world data. This implies that the test data would be sampled from a distribution that is identical to the one trained on, albeit with an added noise component. To test the stability of networks against natural corruption, we perturb each test input image according to where each component of the perturbation vector is drawn from N (0, 1). In all cases except CDT, the networks are trained using clean FMNIST training data, but their accuracy is evaluated on corrupted FMNIST testing data. x i → 1 -σ 2 noise x i + σ noise δ i , In Fig. 2 , we demonstrate that models trained with NINR are more robust against the noise defined above compared to those trained via other regularization methods. This verifies our expectation that in the decay phase, NINR improves stability, at least as well as CDT for large corruption, while (unlike CDT) it does not degrade generalization performance for small corruption or clean data. We also note that noise injection offers the best results for robustness within the catapult regime. However, to arrive at the same accuracy on the clean test dataset, more training epochs are generally required. Lastly, we empirically observe that in-NINR in the catapult phase offers the greatest improvement in stability against input corruption for both networks, as in-NINR most closely resembles input data corruption. We repeat this experiment for the CIFAR-10 dataset in App. E, where the similar trends persist, though at the cost of a longer training time in the catapult phase.

3.2.2. DOMAIN SHIFT

Another test of the generalization properties induced by NINR can be realized by considering Domain Shift problems. Here, we consider the generalization between two different datasets, representing different marginal distributions, by training models with NINR on the MNIST dataset, and testing their performance on data drawn from a new target domain distribution: the USPS test set (Hull, 1994) . In order to match the input dimensions of the MNIST data, we follow the original rescaling and centering done in LeCun et al. (1998) . The USPS images were size normalized to fit a 20 × 20 pixel box while preserving their aspect ratio, and then centered in a 28 × 28 image field, followed by the standard preprocessing procedure. The results are presented in Table 1 . We observe generalization improvement for both FC and convolutional networks when using different regularization schemes, compared to unregularized networks. Of particular interest are the gains obtained when implementing both in-and full-NINR in the catapult phase, with the convolutional network. As the architecture becomes more complex, the improvements from NINR's adaptive scheme becomes more pronounced. The enhanced performance implies that NINR in the catapult phase could prove very beneficial for domain adaptation tasks. This is not entirely surprising as the USPS dataset is expected to lie close to the MNIST training set in distribution space, as there are no new correlated features such as several different digits in one image. Further experiments for domain shift adaptation on the MNIST-C dataset can be found in App. F. For other datasets, with large distributional shifts away from MNIST, and novel input correlations, we do not expect NINR to generically outperform other regularization methods.

3.2.3. GENERALIZATION TO TEST DATA

In this section we report some effects of NINR on generalization from training to test data. While we showed above that NINR can substantially improve network performance on corrupted data, it is also important that it does not fundamentally impair the network's generalization properties. Generically, introducing input corruption during training to increase robustness can be shown to have a negative effect on generalization on clean data. This is unsurprising as it appears that the network essentially memorizes the noise (Zhang et al., 2016) , which is clearly not part of the true data distribution. As the learning process with NINR inherently leads to a suppression of the noise during the late stages, its generalization capabilities are expected to be far less affected. We verify this by comparing the performance of a network trained with NINR against networks trained with L 2 , Dropout, CDT, and against unregularized DNNs. In Table 2 we show generalization performance on the FMNIST test set for the FC and CNN architectures using the full dataset, consisting of 60 000 training examples with a 60/40 training/validation split. Our main observation is that optimizing with NINR in the decay phase, as with the commonly used L 2 and dropout regularizers, leads to performance on clean data with indomain test samples as least as good as the unregularized case. (In no case is the performance better at a statistically significant level.) We note that some degradation occurs when training with noise injection in the catapult phase, for a fixed number of training epochs. This degradation can be ameliorated by training for a longer period. Contrasting NINR with CDT, Table 2 clearly demonstrates that generalization is compromised for the latter, as the network cannot distinguish data from noise, learning the corrupted distribution. We further verify these results for CIFAR-10 in App. E.

4. RELATED WORK

Noise injection during training as a method of enhancing robustness has been proposed in various configurations in the literature. These include adding noises to input data (Hendrycks et al., 2019; Gao et al., 2020; Liu et al., 2021 ), activations, outputs, weights, gradients (Holmström & Koistinen, 1992; Reed & Marks, 1999; Neelakantan et al., 2015; You et al., 2019) and more. Most studies keep the amount of injected noise fixed, while we allow the network to reduce its effect during training. Our study expands upon these works, consolidating empirical evidence with analytical insights. Our main contributions are twofold: We provide analytic expressions for the implicit regularization terms generated within our scheme, as well as estimating their effects during training. When applied to specific architectures, this allows us to predict when NINR is expected to be most effective. Additionally, we probe a novel phase of learning, starting with a large amount of noise injection and leading to a greater improvement in robustness against input corruption. 

5. CONCLUSIONS

In this paper, we motivated Noise Injection Node Regularization as a task-agnostic method to improve stability of models against perturbations to input data. Our method is simply implementable in any open source automatic differentiation system. While we restricted this initial study to a single Noise Injection Node added to various layers, with a fixed scale of noise injection during training, this restriction can be relaxed, leading to potential improvements to NINR. For instance, changing the amount of injected noise during training, similar to learning rate scheduling, could aid in convergence speed while still obtaining the advantages of a large amount of noise injection.

A NETWORK ARCHITECTURE DETAILS

Here we describe the experimental settings specific to each of the figures in the paper. All the models have been trained with cross-entropy loss unless otherwise specified.  L(θ, W NI ) = 1 |B| {x,ϵ,y}∈B L(θ, W NI ; x, ϵ, y) = 1 |B| {x,ϵ,y}∈B e ϵW T NI ∇ z (ℓ NI ) L(θ; x, y) = L(θ) + 1 |B| {x,ϵ,y} ∈B ∞ k=1 1 k! (ϵW T NI • ∇ z (ℓ NI ) ) k L(θ; x, ϵ, y). Expanding in powers of ϵW NI , we obtain an infinite series given by L(θ, W NI ) = L(θ) + 1 |B| {x,ϵ,y}∈B ∞ k=1 1 k! (ϵW T NI • ∇ z (ℓ NI ) ) k L(θ; x, ϵ, y), where we identify the k ≥ 1 terms in the expansion with the implicit regularization terms defined in Eq. (3).

B.2 NIN AT ALL LAYERS

Here we extend our theoretical derivations from the case of a NIN connected to a single layer, to a single NIN connected to all layers (the full-NINR case). Similar to Eq. (1), we may write down the loss using the translation operator, L(θ, W NI ) = 1 |B| {x,y,ϵ}∈B N L -1 ℓ=0 e ϵ(W (ℓ) NI ) T ∇ z (ℓ) L(θ; x, y), where as one may expect, we now have N L vectors W (0) NI , ..., W (N L -1) NI , of respective dimensions R 1×d1 , ..., R 1×d N L . Focusing once more on the leading terms in ϵ, we can see that the first order Published as a conference paper at ICLR 2023 regularization term is simply R 1 = N L -1 ℓ=0 W (ℓ) NI T • ⟨ϵg ℓ ⟩, which is the sum of the regularization terms at each layer. The second order regularization is slightly more complex, and may be written as R 2 = 1 2 N L -1 ℓ1=0 N L -1 ℓ2=0 (W (ℓ1) NI ) T ⟨ϵ 2 H ℓ1ℓ2 ⟩W (ℓ2) NI , with H ℓ1ℓ2 ≡ ∇ z (ℓ 1 ) ∇ T z (ℓ 2 ) L(θ, x, y). According to Botev et al. (2017) ; Martens & Grosse (2015) , the terms which mix different layers in the Hessian are expected to be small, thus leading us to a sum over the single-layer R 2 s of the each ℓ. Therefore, we may use the same arguments used in the main text to estimate the scaling of the two terms with σ ϵ and |B|. Much like the single-layer case, we therefore expect R 1 ∝ σ ϵ / |B| and R 2 ∝ σ 2 ϵ . Demonstrating the emergence of the phases discussed in the main text (i.e. the decay phase, the catapult phase, etc.) on a deep linear network for this full-NINR case is beyond the scope of this work. However, we do note that empirically they are found to be present much like in the singlelayer NIN case. In order to elucidate the interpretation of noise injection nodes as an emergent regularization scheme, combined with a form of constrained random walk, we employ an (over)simplified univariate linear model which captures the main features present in realistic networks. Consider a linear network (i.e., linear activation functions), with a single hidden layer and no biases (b = 0), aiming to perform a linear regression task. The data consists of a set of training samples (x a , y a ) ∈ R 2 m a=1 , taken by drawing the inputs from a normal distribution of x a ∈ X ∼ N (0, σ 2

C NINR IN A LINEAR TOY MODEL

x ), where σ x > 0. The corresponding outputs are then given by a linear transformation y a = M • x a with M ∈ R. The noise node is added at the input, and its weight is w NI , with the input's data weight being w (0) . The hidden layer is directly connected to the output, and has a single weight associated with it, w (1) , as illustrated in Fig. 5 . We use Mean Squared Error (MSE) loss, and for simplicity take a full-batch gradient descent, and thus our loss function is L MSE = 1 2|B| a∈B w (1) (w (0) • x a + w NI ϵ a ) -y a 2 . ( ) Performing an explicit averaging, we can further simplify to L MSE ≃ 1 2 2w (1) w NI w (1) w (0) -M σ x σ ϵ Φ |B| + (w (1) w (0) -M ) 2 σ 2 x + (w (1) ) 2 w 2 NI σ 2 ϵ , where Φ is a random variable with zero mean and unit variancefoot_7 . From Eq. ( 15), we can easily see that the optimal solution is achieved for the weights w (1) * w (0) * = M , w NI, * = 0. We can clearly read R 1,2 off Eq. ( 4) by separating them from the unperturbed loss function L MSE (w NI , θ) ≃ L MSE (θ) + R 1 (w NI , θ) + R 2 (w NI , θ), while R k vanish for k > 2. The various terms in Eq. ( 16) are given by L MSE (θ) = 1 2 (w (1) w (0) -M ) 2 σ 2 x , R 1 (w NI , θ) = w NI ⟨ϵg ℓNI ⟩ = w NI w (1) w (1) w (0) -M σ x σ ϵ Φ |B| , R 2 (w NI , θ) = 1 2 σ 2 ϵ w 2 NI H ℓNI = 1 2 (w (1) ) 2 w 2 NI σ 2 ϵ , where we have identified the local gradient term which generates a constrained random walk for w NI , which decreases as the network approaches its data driven minimum. We also note that in the limit of an infinite batch size |B| → ∞ it vanishes, leaving only an effective regularization term for the hidden layer weight, namely L 2 = λ(w (1) ) 2 where the Lagrange multiplier λ = 1 2 w 2 NI σ 2 ϵ decreases with time as the noise weight w NI is pushed to 0. We may glean further insights from this linear example by studying its training dynamics for small and large noise variances. Assuming full batch gradient descent in the infinite sample limit, we neglect the local gradient contribution and focus on the coupled equations for the hidden layer weight and the noise weight, given respectively by w (1) t+1 = w (1) t (1 -ησ 2 ϵ w 2 NI,t ) -η(w (1) t w (0) t -M )w (0) t σ 2 x , w NI,t+1 = w NI,t (1 -ησ 2 ϵ (w (1) t ) 2 ) Assuming M ̸ = 0, without the loss of generality, we may set M = 1 as the equations remain invariant under reparameterizationfoot_8 . In the limit of σ 2 ϵ ≪ 1/ηw 2 NI,0 , the equations decouple, with the data weight following the standard GD equation without noise, i.e., w (1) t+1 = w (1) t -η(w (1) t w (0) t - 1)w (0) t σ 2 x , while the noise weight decays exponentially as long as 0 < |w (1) t | and (w (1) 0 ) 2 > ησ 2 ϵ /2. Clearly, the smaller σ ϵ is, the smaller the regularizing effect of the noise on the local Hessian, given by the square of the hidden layer weights in this simple model. We expect that in this regime, the limit of continuous time GD should reproduce the correct dynamics as ησ 2 ϵ → 0, yielding a differential equation for the noise weight ẇNI (t) = -σ 2 ϵ (w (1) (t)) 2 w N I (t), where ẋ = dx/dt is the continuous time derivative. The noise weight can therefore only decay. Conversely, taking the large noise variance limit we find that the dynamics are ignorant of the original learning objective, as the resulting equations become simply coupled w (1) t+1 = w (1) t (1 -ησ 2 ϵ w 2 NI,t ), w NI,t+1 = w NI,t 1 -ησ 2 ϵ (w (1) t ) 2 . ( ) These equations describe a NN, trained using completely random data with no labels or learning objective, with an effective loss given by the last term in Eq. ( 15). In this case, we expect the continuous time limit to fail as a complete description of the possible dynamics, as ησ 2 ϵ may be large. We may demonstrate this failure by taking the continuous time limit, obtaining ẇNI (t) = -σ 2 ϵ (w (1) (t)) 2 w N I (t), ẇ(1) (t) = -σ 2 ϵ (w NI (t)) 2 w (1) (t), implying both weights decrease in magnitude. This means the network, even for arbitrarily large σ ϵ will not diverge. However, this is clearly not the case for the discrete Eq. ( 20), which may become stiff for sufficiently large noise variance. This numerical artifact entirely changes the weight behavior, opening up the possibility for the system to either diverge, or catapult, as discussed in Levi et al. (2022) . To summarize, this simple example provides a useful test case for our main analytical derivations appearing in the main text, displaying all the expected features of NINR in a fully calculable setting.

D ADDITIONAL EXPERIMENTS

Throughout this section, we train the FC and the CNN using the same specifications as given in Fig. 2 , unless otherwise specified. Training is performed for the minimum between 500 epochs, and the time it takes the network to reach 98% training accuracy. This is done with the goal of demonstrating that NINR using a large amount of noise injection requires a longer period of training, otherwise suffering from degraded generalization performance, as discussed in the main text.

D.1 ADVERSARIAL ATTACKS

In addition to input perturbations caused by deployment issues, natural degradation, and unexpected noise sources, targeted perturbations, meant to maximally impair the performance of a network while changing the data as little as possible, form a conceptually different concern. Quantifying what corresponds to a minimal distortion of the data is a domain-specific and somewhat subjective task. Nevertheless, standard approaches exist. One of the simplest known implementations for an adversarial attack is the white-box untargeted Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014) , which transforms inputs according to x → x + δ FGSM × sign(∇ x L(θ; x, y)), (22) where δ FGSM is a small positive parameter that controls the size of the perturbation. We also consider the Projected Gradient Descent (PGD) attack (Kurakin et al., 2016; Madry et al., 2017) , which iterates the FGSM attack k times, compounding its effect. In Fig. 6 we compare the performance of standard regularization schemes with NINR against FGSM and PGD type adversarial attacks. We find that NINR displays superior performance over L 2 and un-regularized nets. For FGSM attacks dropout performs best among the options tested, while for PGD attacks NINR outperforms. These preliminary results suggest potential improvement against certain types of adversarial attacks when NINR is used. Further analysis is required to determine whether combining NINR with other regularization schemes, or changing the noise distribution during training could potentially produce a more successful scheme.

D.2 DIFFERENT NOISE DISTRIBUTIONS

Here, we examine the effects of sampling the NINs from different noise distributions on the performance of NINR. For each different noise distribution, we repeat the tests used to produce Fig. 2 , demonstrating robustness against corrupted inputs. We compare results using a uniform distribution ϵ ∼ U (-σ ϵ , σ ϵ ) and an asymmetric (double Gaussian peaked at ±σ ϵ ) distribution, for fcNINR and cNINR using DNNs trained on the FMNIST dataset. In Fig. 7 , we see that varying the noise distribution has a minimal effect on NINR as a regularization scheme, aside from the asymmetric distribution for the catapult phase. We attribute this behavior to an extreme choice of noise injection scale, where a much longer training time is required to obtain good performance for NINR. 

D.3 DIFFERENT OPTIMIZERS

Here, we examine the effects of changing the optimization algorithm, beyond SGD, on the performance of NINR. For each different optimizer, we repeat the tests used to produce Fig. 2 , demonstrating robustness against corrupted inputs. We compare results using RMSprop (Hinton, 2012) and Adam (Kingma & Ba, 2014), for fcNINR and cNINR using DNNs trained on the FMNIST dataset. Here, we use different parameters for the different architectures and optimizers. Namely, RMSprop -ρ = 0.9, ϵ = 10 -7 and η = 0.0001 for FC and η = 0.001 for CNN. Adamβ 1 = 0.9, β 2 = 0.999, ϵ = 10 -7 and η = 0.01 for both FC and CNN, with noise injection magnitudes given in App. D.3. The network used to test NINR performance is constructed by connecting the following blocksfoot_9 : • Conv2D(32,3,3) → ReLU → Batch Norm → Conv2D(32,3,3) → ReLU → Batch Norm → MaxPool(2,2) → Dropout(p drop = 0.2). • Conv2D(64,3,3) → ReLU → Batch Norm → Conv2D(64,3,3) → ReLU → Batch Norm → MaxPool(2,2) → Dropout(p drop = 0.3). • Conv2D(128,3,3) → ReLU → Batch Norm → Conv2D(128,3,3) → ReLU → Batch Norm → MaxPool(2,2) → Dropout(p drop = 0.4). • Dense ReLU Layer(500) → Linear Layer(10). Optimization is done using SGD without momentum with the learning rate fixed to η = 0.05 and mini-batch size B = 128. Each training run is performed for 500 SGD training epochs in total, or until 98 % training accuracy has been achieved. We provide preliminary results for robustness against input-data corruption in Fig. 9 . In contrast to the previous sections, the CNN used to train on CIFAR-10 contains Dropout and L 2 as part of its architecture, making comparison between NINR and the two redundant. Therefore, we show results for the same network with and without NINR, as well as CDT with different input corruption scales. The success of NINR is retained for in-NINR in the decay phase, while the catapult phase requires longer than 500 epochs to obtain similar generalization properties. 

F RESULTS FOR MNIST-C

Here, we show some additional results for the same architectures and NINR parameters used in Sec. 3, trained on the MNIST data-set and tested on several classes of images from MNIST-C. The MNIST-C dataset (Mu & Gilmer, 2019) consists of 15 types of corruption applied to the MNIST test set, for benchmarking out-of-distribution robustness in computer vision. For testing purposes, the data is preprocessed similarly to the FMNIST dataset. The results shown in Table 4 indicate improved performance when the type of image corruption applied to the MNIST images most closely resembles the injected noise. It can therefore be intuitively understood why the most dramatic performance enhancement is found for the Impulse Noise corruption transformation, while other corruption transformation may not benefit much from NINR. We stress that NINR can be readily modified to deal with different types of corruption by changing the noise injection distribution, as well as incorporated with other regularization methods to compound their robustness enhancing effects.

G CONSTANT NOISE INJECTION

Here, we reproduce the results shown in Fig. 2 , including an additional curve representing a constant input noise injection. We implement this experiment by applying dNINR at the input layer (connecting to the first hidden layer) using the large NIN variance value used for the the "catapult" phase, but keeping the NIWs static, fixed to their initialization values. 



One may also generate ϵ only once, before training. We empirically find no difference between the two options, which is expected from large batch averaging. For |B| ≲ 10, differences begin to emerge. In practice, piecewise analytic activation functions such as ReLU are often used, and if the noise causes the crossing of a non-analytic point, the above expansion receives corrections. Empirically, we find this subtlety to not change any of our qualitative conclusions. We note that the noisy loss function Eq. (1) is invariant under the simultaneous rescaling of wNI → λwNI and ϵ → λ -1 ϵ. Nonetheless, the SGD optimization equations are not invariant under this transformation, implying, in particular, that the value of the injected noise variance, σ 2 ϵ , is a relevant parameter, not degenerate with the initialization values of the NIW. As a consequence, in order to fully explore the parameter space of noise injection, the noise (or more precisely, its variance) cannot be assumed to be small, and large noise injection values must be considered. In fact, it is shown inLevi et al. (2022) that all odd-terms in the expansion Eq. (2), are suppressed by the square root of the batch size, while the even terms are not. We comment on the slight subtlety of biases in Eq. (6). In any reasonable scenario, biases would not be corrupted, but if bias is treated as the zeroth component of x, the zeroth component of δ should be ≡ 0. Taking this into account, the trace operation of Eq. (6) should not sum over the zeroth dimension. An analogous phase related to the size of the training step was discussed inLewkowycz et al. (2020). As our procedure for cNINR preserves the structure of the original x(ℓ) , it can be easily applied for other architectures beyond convolutional layers, including densely connected layers. Additional O(σ 2 ϵ / |B|) corrections coming from stochastic variations in the σ 2 ϵ term emerge from batchaveraging but are neglected. Taking w (1) → M w(1) , wNI → M wNI and σϵ → σϵ/M leaves the equations invariant. Each convolutional layer admits L2 weight decay regularization (λWD = 10 -4 ).



Figure 1: Illustration of perturbations to data inputs with respect to the joint probability distribution manifold of features and labels. Points indicate {sample, label} pairs {x, y}, where different colored points correspond to samples drawn from different marginal distributions. Black points represent pairs from a training dataset {xi, yi} N i=1 , with the red spheres indicating corrupted inputs, determined by shifted distribution functions f corrupted (x + ϵ, y). The gray arrow represents an adversarial attack, performed by ascending up the gradient of the network output to reach the closest decision boundary, while generalization from training to test data is depicted as interpolation from black to blue points. Finally, domain shift is a shift in the underlying distribution on the same manifold, depicted by the green arrow and points.

Figure2: Robustness against random input perturbations tested on FC network (left) andCNN (right). Test accuracy vs. the scale of input noise corruption defined in Eq. (7) is shown for L2, Dropout, in-NINR, full-NINR and CDT with σnoise = 0.4. Shades indicate 2 standard deviations estimated over 10 distinct runs. For the input-NINR (full-NINR) fully-connected implementations we take σϵ = 51.8 (16.4) in the decay phase and σϵ = 231.6 (51.8) in the catapult phase. Similarly, for the convolutional implementations we take σϵ = 2.8 (0.9) in the decay phase and σϵ = 87.5 (62) in the catapult phase. This key result illustrates that NINR significantly increases the robustness of generic architectures trained on the FMNIST dataset, while marginally affecting generalization (σnoise = 0). Comparing the CDT and input-NINR curves demonstrates the advantage of our regularization method. While both techniques perform similarly well on data corruption of σnoise = 0.4, CDT is significantly worse on clean data. This is a result of the CDT network being forced to fit both noise and data, without the ability to suppress the latter, a crucial attribute of NINR. Here, the learning rate is fixed to η = 0.05 with mini-batch size B = 128. Each training run is performed for 500 SGD training epochs in total, or until 98% training accuracy has been achieved. For further details, see Sec. 3 and App. A.

Figure 3: NIW dynamics during training for the various phases discussed in Sec. 2.2, for a single hidden layer FC network with ReLU activations trained on the full FMNIST dataset, as specified in App. A. Here, we show the evolution of the NIWs norm (blue) as well as the hidden layer weights norm |W (ℓ NI +1) | (red) against the training (violet) and test (green) loss (solid) and accuracy (dashed). Left to right: The NIN magnitude determines the phase of the system, ranging from the smallest amount in the decoupled phase, to an overwhelming amount in the divergent phase. The behavior displayed by the NIWs, as well as the loss corroborates the predictions discussed in Sec. 2.2 and App. C, with experimental details in App. A.

Figure 4: Left: An illustration of a fully connected NINR (fcNINR) for which a Noise Injection Node is appended to a representation x (ℓ) . Right: Implementation of NINR for a convolutional network (cNINR), where the Noise Injection Node is connected pixel-wise to the image representation x (ℓ) , to be subsequently fed into a convolutional layer. size B = 128. Each training run is performed for 500 SGD training epochs in total, or until 98% training accuracy has been achieved, unless otherwise specified. All test accuracy evaluations are done with the NIN output set to 0, i.e., ϵ = 0. The model parameters θ, W NI are initialized at iteration t = 0 using a normal distribution as σ 2 θ0 , σ 2 WNI,0 = 1/d ℓ , 1/d ℓNI .The hyperparameters are chosen to match reference implementations: the L 2 regularization coefficient (weight decay) is set to λ WD = 5 • 10 -4 and the dropout rate is set to p drop = 0.5. When using L 2 or Dropout, they are applied at/after each layer. When using input CDT as a regularization method, we corrupt the input data according to Eq. (7) below. We further stress that once a NIN has been added to the network, no further modifications to the training algorithm or architecture are required, and after choosing where to connect the NIN, the only free parameter is the injected noise variance σ 2 ϵ .

Fig. 2(a). Fully connected, three hidden layers with width d 1,2,3 = 1024, weight initialization W (ℓ) ∼ N (0, 1/d ℓ ), b = 0. ReLU activation, trained using SGD (no momentum) on FMNIST, with a learning rate of η = 0.05 and batch size |B| = 128.Fig. 2(b). CNN composed of 2 convolutional blocks, followed by a dense ReLU layer with width d 3 = 2048 and a dense connection to the prediction layer, weight initialization W (ℓ) ∼ N (0, 1/d ℓ ), b = 0. Each convolutional block is taken as Conv2D(2,2) → ELU → Batch-Norm → MaxPool(2,2), trained using SGD (no momentum) on FMNIST, with a learning rate of η = 0.05 and |B| = 128.

Fig. 3. Fully connected, one hidden layer with d 1 = 1024, weight initialization W (ℓ) ∼ N (0, 1/d ℓ ), b = 0. ReLU activation, trained using SGD (no momentum) on FMNIST. with learning rate η = 0.01 and |B| = 1000. From left to right, the injected noise is σ 2 ϵ = {0, 0.1 • d in /η, d in /η, 1.8 • d in /η}, corresponding to the decoupled, decay, catapult, and divergent phases, respectively. Here, d in = 785 is the dimension of the input data including a single NIN.

Figure 5: Illustration of the univariate linear DNN with a single input scalar, noise node, and one hidden layer.

Figure 6: Robustness against adversarial attacks. Left: FC network, Right: CNN (detailed specifications are in App. A). This key result illustrates that NINR significantly increases the robustness of generic model architectures trained on the FMNIST dataset. Shades indicate 2 standard deviations estimated over 10 distinct runs.

Figure 7: Robustness against random input perturbations for FC (top row) and convolutional (bottom row) networks using NINR training with different noise distributions (detailed specifications are in App. A). Left: Asymmetric double gaussian distribution peaked at ±σϵ , Right: Uniform distribution ϵ ∼ U (-σϵ, σϵ) . The fully-connected and CNN NINR noise magnitudes are those of Fig. 2. Shades indicate 2 standard deviations estimated over 5 distinct runs.

Figure 8: Robustness against random input perturbations for FCs (top row) and CNNs (bottom row) using NINR training under different optimization schemes (detailed specifications are in App. A). Left: Adam, trained with η = 0.01 for both FC and CNN, Right: RMSprop, trained with η = 0.0001 for FC and η = 0.001 for CNN. The fully-connected and CNN NINR noise magnitudes are those of App. D.3. Shades indicate 2 standard deviations estimated over 5 distinct runs.

Figure 9: Robustness against random input perturbations for the network described in App. E using NINR training on CIFAR10. Here, we use the in-NINR CNN implementation, taking σϵ = 17.5 in the decay phase and σϵ = 55.4 in the catapult phase.

Figure 10: Robustness against random input perturbations with the same parameters used in Fig. 2. The additional orange curve represents In-NINR with σϵ = 231.6 but with fixed NIW values, which is essentially constant noise injection to the pre-activation at the first hidden layer.

A test of domain shift, trained on MNIST up to 100% training accuracy and evaluated on the USPS test dataset, comparing L2, Dropout, in-NINR, and full-NINR as discussed in the text. We exclude CDT from this table, as it is not as prevalent as the other regularization schemes, and is tailored for random noise. The fully-connected and CNN NINR noise magnitudes are those of Fig.2. Errors indicate 2σ confidence intervals over 10 distinct runs for full training. We find the test accuracy on the full clean MNIST dataset to be similar across all regularization schemes, with ∼ 97% and ∼ 98% for FC and CNN respectively.

Generalization on clean test data, evaluated on the FMNIST dataset. Comparison is made between L2, Dropout, in-NINR, full-NINR and CDT, and we highlight regularization methods which worsen generalization by italicizing. For CDT, two values (σnoise = 0.2, 0.4) for the amount of corruption are considered. The fully-connected and CNN NINR noise parameters are those used in Fig.2. Errors indicate 2σ confidence intervals over 10 distinct runs. The NINR implementations, except perhaps the in-NINR at the catapult phase, have comparable generalization performance with to the rest of the regularization schemes, aside from CDT, where performance is diminished as the network learns the noisy distribution rather than the original one.

Works by Rakin et al. (2018) and Xiao et al. (2021) follow similar reasoning, though both are limited, by construction, to a small amount of noise injection, and are more empirically driven. Rakin et al. (2018) and Rusak et al. (2020) also feature complex custom update steps which are less adaptable to other architectures.

Amount of noise injection (σϵ) for the different architectures using RMSprop and Adam , we implement cNINR, working with a CNN based on VGG style blocks described in Simonyan & Zisserman (2014). The CIFAR-10 dataset consists of color images of objects divided into 10 categories, with 32 × 32 pixels in 3 color channels, each pixel intensity in the range [0, 1], partitioned into 50 000 training and 10 000 test samples, which are then preprocessed similarly to the FMNIST dataset.

Domain shift performance on the MNIST-C test data, for networks trained on the MNIST dataset. Comparison is made between L2, Dropout, in-NINR, full-NINR and CDT. For CDT, two values (σnoise = 0.2, 0.4) for the amount of corruption are considered. The fully-connected and CNN NINR noise parameters are those used in Fig.2. The NINR implementations improve performance for data transformations which are most closely related to gaussian noise injection, as can be expected.

6. ACKNOWLEDGEMENTS

We thank Yasaman Bahri, Kyle Cranmer, Guy Gur-Ari, and Sho Yaida for useful discussions and comments. NL would like to thank the Milner Foundation for the award of a Milner Fellowship. MF is supported by the DOE under grant DE-SC0010008 and the NSF under grant PHY1316222. MF would like to thank Tel Aviv University, the Aspen Center for Physics (supported by the U.S. National Science Foundation grant PHY-1607611), and the Galileo Galilei Institute for their hospitality while this work was in progress. The work of TV is supported by the Israel Science Foundation (grant No. 1862/21), by the Binational Science Foundation (grant No. 2020220) and by the European Research Council (ERC) under the EU Horizon 2020 Programme (ERC-CoG-2015 -Proposal n. 682676 LDMThExp).

REPRODUCIBILITY STATEMENT

In Sec. 2, we state our theoretical results, ensuring that we state our assumptions and the limitations of the approximations we make at every step. In several instances, we rely on proofs given in other works, as well as supplement our analyses in App. C; The models and tools used for analysis in our experiments are provided in the following anonymous link: https://anonymous.4open. science/r/NoiseInjectionNodeCode-2A68, while explicit details regarding our experimental setup as well as a complete description of the data processing steps for the datasets we used, are given in Sec. 3 and App. A.

