SIMPLE INITIALIZATION AND PARAMETRIZATION OF SI-NUSOIDAL NETWORKS VIA THEIR KERNEL BANDWIDTH

Abstract

Neural networks with sinusoidal activations have been proposed as an alternative to networks with traditional activation functions. Despite their promise, particularly for learning implicit models, their training behavior is not yet fully understood, leading to a number of empirical design choices that are not well justified. In this work, we first propose a simplified version of such sinusoidal neural networks, which allows both for easier practical implementation and simpler theoretical analysis. We then analyze the behavior of these networks from the neural tangent kernel perspective and demonstrate that their kernel approximates a low-pass filter with an adjustable bandwidth. Finally, we utilize these insights to inform the sinusoidal network initialization, optimizing their performance for each of a series of tasks, including learning implicit models and solving differential equations. Analysis of the initialization scheme. The initialization scheme proposed above differs from the one implemented in SIRENs. We will now show that this particular choice of initialization distribution preserves the variance of the original proposed SIREN initialization distribution. As a consequence, the original theoretical justifications for its initialization scheme still hold under this activation, namely that the distribution of activations across layers are stable, well-behaved and shift-invariant. Due to space constraints, proofs are presented in Appendix A. Moreover, we also demonstrate empirically that these properties are maintained in practice. Lemma 1. Given any c, for X ∼ N 0, 1 3 c 2 and Y ∼ U (-c, c), we have Var[X] = Var[Y ] = 1 3 c 2 .

1. INTRODUCTION

Sinusoidal networks are neural networks with sine nonlinearities, instead of the traditional ReLU or hyperbolic tangent. They have been recently popularized, particularly for applications in implicit representation models, in the form of SIRENs (Sitzmann et al., 2020) . However, despite their popularity, many aspects of their behavior and comparative advantages are not yet fully understood. Particularly, some initialization and parametrization choices for sinusoidal networks are often defined arbitrarily, without a clear understanding of how to optimize these settings in order to maximize performance. In this paper, we first propose a simplified version of such sinusoidal networks, that allows for easier implementation and theoretical analysis. We show that these simple sinusoidal networks can match and outperform SIRENs in implicit representation learning tasks, such as fitting videos, images and audio signals. We then analyze sinusoidal networks from a neural tangent kernel (NTK) perspective (Jacot et al., 2018) , demonstrating that their NTK approximates a low-pass filter with adjustable bandwidth. We confirm, through an empirical analysis this theoretically predicted behavior also holds approximately in practice. We then use the insights from this analysis to inform the choices of initialization and parameters for sinusoidal networks. We demonstrate we can optimize the performance of a sinusoidal network by tuning the bandwidth of its kernel to the maximum frequency present in the input signal being learned. Finally, we apply these insights in practice, demonstrating that "well tuned" sinusoidal networks outperform other networks in learning implicit representation models with good interpolation outside the training points, and in learning the solution to differential equations.

2. BACKGROUND AND RELATED WORK

Sinusoidal networks. Sinusoidal networks have been recently popularized for implicit modelling tasks by sinusoidal representation networks (SIRENs) (Sitzmann et al., 2020) . They have also been evaluated for physics-informed learning, demonstrating promising results in a series of domains (Raissi et al., 2019b; Song et al., 2021; Huang et al., 2021b; a; Wong et al., 2022) . Among the benefits of such networks is the fact that the mapping of inputs through an (initially) random linear layer followed by a sine function is mathematically equivalent to a transformation to a random Fourier basis, rendering them close to networks with Fourier feature transforms (Tancik et al., 2020; Rahimi & Recht, 2007) , and possibly able to address spectral bias (Basri et al., 2019; Rahaman et al., 2019; Wang et al., 2021) . Sinusoidal networks also have the property that the derivative of their outputs is given simply by another sinusoidal network, due to the fact that the derivative of sine function is a phase-shifted sine. Neural tangent kernel. An important prior result to the neural tangent kernel (NTK) is the neural network Gaussian process (NNGP). At random initialization of its parameters θ, the output function of a neural network of depth L with nonlinearity σ, converges to a Gaussian process, called the NNGP, as the width of its layers n 1 , . . . , n L → ∞. (Neal, 1994; Lee et al., 2018) . This result, though interesting, does not say much on its own about the behavior of trained neural networks. This role is left to the NTK, which is defined as the kernel given by Θ(x, x) = ⟨∇ θ f θ (x), ∇ θ f θ (x)⟩. It can be shown that this kernel can be written out as a recursive expression involving the NNGP. Importantly, Jacot et al. (2018) demonstrated that, again as the network layer widths n 1 , . . . , n L → ∞, the NTK is (1) deterministic at initialization and (2) constant throughout training. Finally, it has also been demonstrated that under some assumptions on its parametrization, the output function of the trained neural network f θ converges to the kernel regression solution using the NTK (Lee et al., 2020; Arora et al., 2019) . In other words, under certain assumptions the behavior of a trained deep neural network can be modeled as kernel regression using the NTK. Physics-informed neural networks. Physics-informed neural networks (Raissi et al., 2019a) are a method for approximating the solution to differential equations using neural networks (NNs). In this method, a neural network û(t, x; θ), with learned parameters θ, is trained to approximate the actual solution function u(t, x) to a given partial differential equation (PDE). Importantly, PINNs employ not only a standard "supervised" data loss, but also a physics-informed loss, which consists of the differential equation residual N . Thus, the training loss consists of a linear combination of two loss terms, one directly supervised from data and one informed by the underlying differential equations.

3. SIMPLE SINUSOIDAL NETWORKS

There are many details that complicate the practical implementation of current sinusoidal networks. We aim to propose a simplified version of such networks in order to facilitate theoretical analysis and practical implementation, by removing such complications. As an example we can look at SIRENs, which have their layer activations defined as f l (x) = sin(ω(W l x + b l )). Then, in order to cancel the ω factor, layers after the first one have their weight initialization follow a uniform distribution with range [- √ 6/n ω , √ 6/n ω ], where n is the size of the layer. Unlike the other layers, the first layer is sampled from a uniform distribution with range [-1/n, 1/n]. We instead propose a simple sinusoidal network, with the goal of formulating an architecture that mainly amounts to substituting its activation functions by the sine function. We will, however, keep the ω parameter, since (as we will see in future analyses) it is in fact a useful tool for allowing the network to fit inputs of diverse frequencies. The layer activation equations of our simple sinusoidal network, with parameter ω, are defined as f 1 (x) = sin(ω (W 1 x + b 1 )), f l (x) = sin(W l x + b l ), l > 1. Finally, instead of utilizing a uniform initialization as in SIRENs (with different bounds for the first and subsequent layers), we propose initializing all parameters in our simple sinusoidal network using a default Kaiming (He) normal initialization scheme. This choice not only greatly simplifies the initialization scheme of the network, but it also facilitates theoretical analysis of the behavior of the network under the NTK framework, as we will see in Section 4. This simple Lemma and relates to Lemma 1.7 in Sitzmann et al. (2020) , showing that the initialization we propose here has the same variance as the one proposed for SIRENs. Using this result we can translate the result from the main Theorem 1.8 from Sitzmann et al. ( 2020), which claims that the SIREN initialization indeed has the desired properties, to our proposed initialization: 1 For a uniform input in [-1, 1] , the activations throughout a sinusoidal network are approximately standard normal distributed before each sine non-linearity and arcsine-distributed after each sine non-linearity, irrespective of the depth of the network, if the weights are distributed normally, with mean 0 and variance 2 n , where n is a layer's fan-in. Empirical evaluation of initialization scheme. To empirically demonstrate the proposed simple initialization scheme preserves the properties from the SIREN initialization scheme, we perform the same analysis performed by Sitzmann et al. (2020) . We observe that the distribution of activations matches the predicted normal (before the non-linearity) and arcsine (after the non-linearity) distributions, and that this behavior is stable across many layers. These results are reported in detail in the Appendix B.

3.1. COMPARISON TO SIREN

In order to demonstrate our simplified sinusoidal network has comparable performance to a standard SIREN, in this section we reproduce the main results from Sitzmann et al. (2020) . Table 1 compiles the results for all experiments. In order to be fair, we compare the simplified sinusoidal network proposed in this chapter with both the results directly reported in Sitzmann et al. ( 2020), and our own reproduction of the SIREN results (using the same parameters and settings as the original). We can see from the numbers reported in the table that the performance of the simple sinusoidal network proposed in this chapter matches the performance of the SIREN in all cases, in fact surpassing it in most of the experiments. Qualitative results are presented in Appendix C. It is important to note that this is not a favorable setting for simple sinusoidal networks, given that the training durations were very short. The SIREN favors quickly converging to a solution, though it does not have as strong asymptotic behavior. This effect is likely due to the multiplicative factor applied to later layers described in Section 3. We observe that indeed in almost all cases we can compensate for this effect by simply increasing the learning rate in the Adam optimizer (Kingma & Ba, 2014) . Finally, we observe that besides being able to surpass the performance of SIREN in most cases in a short training regimen, the simple sinusoidal network performs even more strongly with longer training. To demonstrate this, we repeated some experiments from above, but with longer training durations. These results are shown in Table 4 in Appendix C.

4. NEURAL TANGENT KERNEL ANALYSIS

In the following we derive the NTK for sinusoidal networks. This analysis will show us that the sinusoidal networks NTK is approximately a low-pass filter, with its bandwidth directly defined by ω. We support these findings with an empirical analysis as well in the following section. Finally, we demonstrate how the insights from the NTK can be leveraged to properly "tune" sinusoidal networks to the spectrum of the desired signal. Full derivations and extensive, detailed analysis are left to Appendix D. The NTK for a simple sinusoidal network with a single hidden layer is presented in the theorem below. The NTK for siren with 1 and 6 hidden layers are shown in Figure 1 . Theorem 2. Shallow SSN NTK. For a simple sinusoidal network with one hidden layer f (1) : R n0 → R n2 following Definition 1, its neural tangent kernel (NTK), as defined in Theorem 6, is given by Θ (1) (x, x) = 1 2 ω 2 x T x + 1 + 1 e -ω 2 2 ∥x-x∥ 2 2 - 1 2 ω 2 x T x + 1 -1 e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 + ω 2 x T x + 1 + 1. 1 We note that despite being named Theorem 1.8 in Sitzmann et al. (2020) , this result is not fully formal, due to the Gaussian distribution being approximated without a formal analysis of this approximation. Additionally, a CLT result is employed which assumes infinite width, which is not applicable in this context. We thus refrain from calling our equivalent result a theorem. Nevertheless, to the extent that the argument is applicable, it would still hold for our proposed initialization, due to its dependence solely on the variance demonstrated in Lemma 1 above. We can see that for values of ω > 2, the second term quickly vanishes due to the e -2ω 2 factor. This leaves us with only the first term, which has a Gaussian form. Due to the linear scaling term x T x, this is only approximately Gaussian, but the approximation improves as ω increases. We can thus observe that this kernel approximates a Gaussian kernel, which is a low-pass filter, with its bandwidth defined by ω. Figure 1 presents visualizations for NTKs for the simple sinusoidal network, compared to a (scaled) pure Gaussian with variance ω -2 , showing there is a close match between the two. If we write out the NTK for networks with more than one hidden layer, it quickly becomes un-interpretable due to the recursive nature of the NTK definition (see Appendix D). However, as shown empirically in Figure 1 , these kernels are still approximated by Gaussians with variance ω -2 . We also observe that the NTK for a SIREN with a single hidden layer is analogous, but with a sinc form, which is also a low-pass filter. Theorem 3. Shallow SIREN NTK. For a single hidden layer SIREN f (1) : R n0 → R n2 following Definition 1, its neural tangent kernel (NTK), as defined in Theorem 6, is given by Θ (1) (x, x) = c 2 6 ω 2 x T x + 1 + 1 n0 j=1 sinc(c ω (x j -xj )) - c 2 6 ω 2 x T x + 1 -1 e -2ω 2 n0 j=1 sinc(c ω (x j + xj )) + ω 2 x T x + 1 + 1. For deeper SIREN networks, the kernels defined by the later layers are in fact Gaussian too, as discussed in Appendix D. This leads to an NTK that is approximated by a product of a sinc function and a Gaussian. These SIREN kernels are also presented in Figure 1 .

5. EMPIRICAL ANALYSIS

As shown above, neural tangent kernel theory suggests that sinusoidal networks work as low-pass filters, with their bandwidth controlled by the parameter ω. In this section, we demonstrate empirically that we can observe this predicted behavior even in real sinusoidal networks. For this experiment, we generate a 512 × 512 monochromatic image by super-imposing two orthogonal sinusoidal signals, each consisting of a single frequency, f (x, y) = cos(128πx) + cos(32πy). This function is sampled in the domain [-1, 1] 2 to generate the image on the left of Figure 2 . To demonstrate what we can expect from applying low-pass filters of different bandwidths to this signal, we perform a discrete Fourier transform (DFT), cut off frequencies above a certain value, and perform an inverse transform to recover the (filtered) image. The MSE of the reconstruction, as a function of the cutoff frequency, is shown in Figure 3 . We can see that due to the simple nature of the signal, containing only two frequencies, there are only three loss levels. If indeed the NTK analysis is correct and sinusoidal networks act as low-pass filters, with bandwidth controlled by ω, we should be able to observe similar behavior with sinusoidal networks with different ω values. We plot the final training loss and training curves for sinusoidal networks with different ω in Figure 3 . We can observe, again, that there are three consistent loss levels following the magnitude of the ω parameter, in line with the intuition that the sinusoidal network is working as a low-pass filter. This is also observable in Figure 2 , where we see example reconstructions for networks of various ω values after training. However, unlike with the DFT low-pass filter (which does not involve any learning), we see in Figure 3 that during training some sinusoidal networks shift from one loss level to a lower one. This demonstrates that sinusoidal networks differ from true low-pass filters in that their weights can change, which implies that the bandwidth defined by ω also changes with learning. We know the weights W 1 in the first layer of a sinusoidal network, given by f 1 (x) = sin ω • W T 1 x + b 1 , will change with training. Empirically, we observed that the spectral norm of W 1 increases throughout training for small ω values. We can interpret that as the overall magnitude of the term ω • W T 1 x increasing, which is functionally equivalent to an increase in ω itself. In Figure 3 , we observe that sinusoidal networks with smaller values of ω take a longer time to achieve a lower loss (if at all). Intuitively, this happens because, due to the effect described above, lower ω values require a larger increase in magnitude by the weights W 1 . Given that all networks were trained with the same learning rate, the ones with a smaller ω require their weights to move a longer distance, and thus take more training steps to achieve a lower loss. 

6. TUNING ω

As shown in the previous section, though the bandwidth of a network can change throughout training, the choice of ω still influences how easily and quickly (if at all) it can learn a given signal. The value of the ω parameter is thus crucial for the learning of the network. Despite this fact, in SIRENs, for example, this value is not adjusted for each task (except for the audio fitting experiments), and is simply set empirically to an arbitrary value. In this section, we seek to justify a proper initialization for this parameter, such that it can be chosen appropriately for each given task. Moreover, it is often not the case that we simply want to fit only the exact training samples but instead want to find a good interpolation (i.e., generalize well). Setting ω too high, and thus allowing the network to model frequencies that are much larger than the ones present in the actual signal is likely to cause overfitting. This is demonstrated empirically in Figure 4 . Consequently, we want instead to tune the network to the highest frequency present in the signal. However, we do not always have the knowledge of what is the value of the highest frequency in the true underlying signal of interest. Moreover, we have also observed that, since the network learns and its weights change in magnitude, that value in fact changes with training. Therefore, the most we can hope for is to have a good heuristic to guide the choice of ω. Nevertheless, having a reasonable guess for ω is also likely sufficient for good performance, precisely due to the ability of the network to adapt during training and compensate for a possibly slightly suboptimal choice. Choosing ω from the Nyquist frequency. One source of empirical information on the relationship between ω and the sinusoidal network's "learnable frequencies" is the previous section's empirical analysis. Taking into account the scaling, we can see from Fig. 3 that around ω = 16 the network starts to be able to learn the full signal (freq. 128). We can similarly note that at about ω = 4 the sinusoidal network starts to be able to efficiently learn a signal with frequency 32, but not the one with frequency 128. This scaling suggests a heuristic of setting ω to about 1/8 of the signal's maximum frequency. For natural signals, such as pictures, it is common for frequencies up to the Nyquist frequency of the discrete sampling to be present. We provide an example for the "camera" image we have utilized so far in Figure 23 in Appendix E, where we can see that the reconstruction loss through a low-pass filter continues to decrease significantly up to the Nyquist frequency for the image resolution. In light of this information, analyzing the choices of ω for the experiments in Section 3.1 again suggests that ω should be set around 1/8 of the Nyquist frequency of the signal. These values of ω are summarized in Table 2 in the "Fitting ω" column. For example, the image fitting experiment shows that, for an image of shape 512 × 512 (and thus Nyquist frequency of 256 for each dimension), this heuristic suggests an ω value of 256/8 = 32, which is the value found to work best empirically through search. We find similar results for the audio fitting experiments. The audio signals used in the audio fitting experiment contained approximately 300, 000 and 500, 000 points, and thus maximum frequencies of approximately 150, 00 and 250, 000. This suggests reasonable values for ω of 18, 750 and 31, 250, which are close to the ones found empirically to work well. In examples such as the video fitting experiments, in which each dimension has a different frequency, it is not completely clear how to pick a single ω to fit all dimensions. This suggests that having independent values of ω for each dimension might be useful for such cases, as discussed in the next section. Finally, when performing the generalization experiments in Section 7, we show the best performing ω ended up being half the value of the best ω used in the fitting tasks from Section 3.1. This follows intuitively, since for the generalization task we set apart half the points for training and the other half for testing, thus dividing the maximum possible frequency in the training sample in half, providing further evidence of the relationship between ω and the maximum frequency in the input signal. Multi-dimensional ω. In many problems, such as the video fitting and PDE problems, not only is the input space multi-dimensional, it also contains time and space dimensions (which are additionally possibly of different shape). This suggests that employing a multi-dimensional ω, specifying different frequencies for each dimension might be beneficial. In practice, if we employ a scaling factor λ = [λ 1 λ 2 . . . λ d ] T , we have the first layer of the sinusoidal network given by f 1 (x) = sin(ω (W 1 (λ ⊙ x) + b 1 )) = sin(W 1 (Ω ⊙ x) + ωb 1 ), where Ω = [λ 1 ω λ 2 ω . . . λ d ω] T works as a multi-dimensional ω. In the following experiments, we employ this approach to three-dimensional problems, in which we have time and differently shaped space domains, namely the video fitting and physics-informed neural network PDE experiments. For these experiments, we report the ω in the form of the (already scaled) Ω vector for simplicity. Choosing ω from available information Finally, in many problems we do have some knowledge of the underlying signal we can leverage, such as in the case of inverse problems. For example, let's say we have velocity fields for a fluid and we are trying to solve for the coupled pressure field and the Reynolds number using a physics-informed neural network (as done in Section 7). In this case, we have access to two components of the solution field. Performing a Fourier transform on the training data we have can reveal the relevant spectrum and inform our choice of ω. If the maximum frequency in the signal is lower than the Nyquist frequency implied by the sampling, this can lead to a more appropriate choice of ω than suggested purely from the sampling.

7. EXPERIMENTS

In this section, we first perform experiments to demonstrate how the optimal value of ω influences the generalization error of a sinusoidal network, following the discussion in Section 6. After that, we demonstrate that sinusoidal networks with properly tuned ω values outperform traditional physicsinformed neural networks in classic PDE tasks.

7.1. EVALUATING GENERALIZATION

We now evaluate the simple sinusoidal network generalization capabilities. To do this, in all experiments in this section we segment the input signal into training and test sets using a checkerboard pattern -along all axis-aligned directions, points alternate between belonging to train and test set. We perform audio, image and video fitting experiments. When performing these experiments, we search for the best performing ω value for generalization (defined as performance on the held-out points). We report the best values on Table 2 . We observe that, as expected from the discussion in Section 6, the best performing ω values follow the heuristic discussed above, and are in fact half the best-performing value found in the previous fitting experiments from Section 3.1, confirming our expectation. This is also demonstrated in the plot in Figure 4 . Using a higher ω leads to overfitting and poor generalization outside the training points. This is demonstrated in Figure 4 , in which we can see that choosing an appropriate ω value from the heuristics described previously leads to a good fit and interpolation. Setting ω too high leads to interpolation artifacts, due to overfitting of spurious high-frequency components. For the video signals, which have different size along each axis, we employ a multi-dimensional ω. We scale each dimension of ω proportional to the size of the input signal along the corresponding axis. Published as a conference paper at ICLR 2023 

7.2. SOLVING DIFFERENTIAL EQUATIONS

Finally, we apply our analysis to physics-informed learning. We compare the performance of simple sinusoidal networks to the tanh networks that are commonly used for these tasks. Results are summarized in Table 3 . Details for the Schrödinger and Helmholtz experiments are presented in Appendix E.

7.2.1. BURGERS EQUATION (IDENTIFICATION)

This experiment reproduces the Burgers equation identification experiment from Raissi et al. (2019a) . Here we are identifying the parameters λ 1 and λ 2 of a 1D Burgers equation, u t +λ 1 uu x -λ 2 u xx = 0, given a known solution field. The ground truth value of the parameters are λ 1 = 1.0 and λ 2 = 0.01/π. In order to find a good value for ω, we perform a low-pass reconstruction of the solution as before. We can observe in Figure 5 that the solution does not have high bandwidth, with most of the loss being minimized with only the lower half of the spectrum. Note that the sampling performed for the training data (N = 2, 000) is sufficient to support such frequencies. This suggests an ω value in the range 8 -10. Indeed, we observe that ω = 10 gives the best identification of the desired parameters, with errors of 0.0071% and 0.0507% for λ 1 and λ 2 respectively, against errors of 0.0521% and 0.4522% of the baseline. This value of ω also achieves the lowest reconstruction loss against the known solution, with an MSE of 8.034 • 10 -4 . Figure 5 shows the reconstructed solution using the identified parameters.

7.2.2. NAVIER-STOKES (IDENTIFICATION)

This experiment reproduces the Navier-Stokes identification experiment from Raissi et al. (2019a) . In this experiment, we are trying to identify, the parameters λ 1 , λ 2 and the pressure field p of the 2D Navier-Stokes equations given by ∂u ∂t + λ 1 u • ∇u = -∇p + λ 2 ∇ 2 u, given known velocity fields u and v. The ground truth value of the parameters are λ 1 = 1.0 and λ 2 = 0.01. Unlike the 1D Burgers case, in this case the amount of points sampled for the training set (N = 5, 000) is not high, compared to the size of the full solution volume, and is thus the limiting factor for the bandwidth of the input signal. Given the random sampling of points from the full solution, the (Raissi et al., 2019a; Sitzmann et al., 2020) . Values are percent error relative to ground truth value for each parameter for identification problems and mean squared error (MSE) for inference problems. The Helmholtz experiment is the same from Section 3.1. T gives the best results. With with errors of 0.0038% and 1.782% for λ 1 and λ 2 respectively, against errors of 0.0046% and 2.093% of the baseline. Figure 6 shows the identified pressure field. Note that given the nature of the problem, this field can only be identified up to a constant.

8. CONCLUSION

In this work, we have present a simplified formulation for sinusoidal networks. Analysis of this architecture from the neural tangent kernel perspective, combined with empirical results, reveals that the kernel for sinusoidal networks corresponds to a low-pass filter with adjustable bandwidth. We leverage this information in order to initialize these networks appropriately, choosing their bandwidth such that it is tuned to the signal being learned. Employing this strategy, we demonstrated improved results in both implicit modelling and physics-informed learning tasks.

A SIMPLE SINUSOIDAL NETWORK INITIALIZATION

We present here the proofs for the initialization scheme of the simple sinusoidal network from Section 3. Proof. By definition, Var[X] = σ 2 = 1 3 c 2 . For Y , we know that the variance of a uniformly distributed random variable with bound [a, b] is given by 1 12 (b -a) 2 . Thus, Var[Y ] = 1 12 (2c) 2 = 1 3 c 2 . Theorem 5. For a uniform input in [-1, 1], the activations throughout a sinusoidal networks are approximately standard normal distributed before each sine non-linearity and arcsine-distributed after each sine non-linearity, irrespective of the depth of the network, if the weights are distributed normally, with mean 0 and variance 2 n with n is the layer's fan-in. Proof. The proof follows exactly the proof for Theorem 1.8 in Sitzmann et al. ( 2020), only using Lemma 4 when necessary to show that the initialization proposed here has the same variance necessary for the proof to follow.

B EMPIRICAL EVALUATION OF SSN INITIALIZATION

Here we report an empirical analysis the initialization scheme of simple sinusoidal networks, referenced in Section 3. For this analysis we use a sinusoidal MLP with 6 hidden layers of 2048 units, and single-dimensional input and output. This MLP is initialized using the simplified scheme described above. For testing, 2 8 equally spaced inputs from the range [-1, 1] are passed through the network. We then plot the histogram of activations after each linear operation (before the sine non-linearity) and after each sine non-linearity. To match the original plot, we also plot the 1D Fast Fourier Transform of all activations in a layer, and the gradient of this output with respect to each activation. These results are presented in Figure 8 . The main conclusion from this figure is that the distribution of activations matches the predicted normal (before the non-linearity) and arcsine (after the non-linearity) distributions, and that this behavior is stable across many layers. We also reproduced the same result up to 50 layers. We then perform an additional experiment in which the exact same setup as above is employed, yet the 1D inputs are shifted by a large value (i.e., x → x + 1000). We the show the same plot as before in Figure 9 . We can see that there is essentially no change from the previous plot, which demonstrates the sinusoidal networks shift-invariance in the input space, one of its important desirable properties, as discussed previously.

C EXPERIMENTAL DETAILS FOR COMPARISON TO SIREN

Below, we present qualitative results and describe experimental details for each experiment. As these are a reproduction of the experiments in Sitzmann et al. ( 2020), we refer to their details as well for further information.

C.1 IMAGE

In the image fitting experiment, we treat an image as a function from the spatial domain to color values (x, y) → (r, g, b). In the case of a monochromatic image, used here, this function maps instead to one-dimensional intensity values. We try to learn a function f : R 2 → R, parametrized as a sinusoidal network, in order to fit such an image. Figure 7 shows the image used in this experiment, and the reconstruction from the fitted sinusoidal network. The gradient and Laplacian for the learned function are also presented, demonstrating that higher order derivatives are also learned appropriately. These tasks are similar to the image fitting experiment, but instead of supervising directly on the ground truth image, the learned fitted sinusoidal network is supervised on its derivatives, constituting a Poisson problem. We perform the experiment by supervising both on the input image's gradient and Laplacian, and report the reconstruction of the image and it's gradients in each case. Figure 10 shows the image used in this experiment, and the reconstruction from the fitted sinusoidal networks. Since reconstruction from derivatives can only be correct up to a scaling factor, we scale the reconstructions for visualization. As in the original SIREN results, we can observe that the reconstruction from the gradient is of higher quality than the one from the Laplacian. Training parameters. The input image used is of size 256 × 256, mapped from an input domain [-1, 1] 2 . The sinusoidal network used is a 5-layer MLP with hidden size 256, following the proposed initialization scheme above. For both experiments, the parameter ω is set to 32 and the Adam optimizer is used. For the gradient experiments, in short and long training results, a learning rate of 1 • 10 -4 is used, trained for 10, 000 and 20, 000 steps respectively. For the Laplace experiments, in short and long training results, a learning rate of 1 • 10 -3 is used, trained for 10, 000 and 20, 000 steps respectively. These tasks are similar to the image fitting experiment, but we instead fit a video, which also has a temporal input dimension, (t, x, y) → (r, g, b). We learn a function f : R 3 → R 3 , parametrized as a sinusoidal network, in order to fit such a video. C.4 AUDIO Figure 13 : Ground truth and reconstructed waveforms for "Bach" and "counting" audios. In the audio experiments, we fit an audio signal in the temporal domain as a waveform t → w. We to learn a function f : R → R, parametrized as a sinusoidal network, in order to fit the audio. Figure 13 shows the waveforms for the input audios and the reconstructed audios from the fitted sinusoidal network. In this experiment, we utilized a lower learning rate for the first layer compared to the rest of the network. This was used to compensate the very large ω used (in the 15, 000-30, 000 range, compared to the 10 -30 range for all other experiments). One might argue that this is re-introducing complexity, counteracting the purpose the proposed simplification. However, we would claim (1) that this is only limited to cases with extremely high ω, which was not present in any case except for fitting audio waves, and (2) that adjusting the learning rate for an individual layer is still an approach that is simpler and more in line with standard machine learning practice compared to multiplying all layers by a scaling factor and then adjusting their initialization variance by the same amount. In this experiment we solve for the unknown wavefield Φ : R 2 → R 2 in the Helmholtz equation (∆ + k 2 )Φ(x) = -f (x), with known wavenumber k and source function f (a Gaussian with µ = 0 and σ 2 = 10 -4 ). We solve this differential equation using a sinusoidal network supervised with the physics-informed loss Ω ∥(∆ + k 2 )Φ(x) + f (x)∥ 1 dx, evaluated at random points sampled uniformly in the domain Ω = [-1, 1] 2 . Figure 14 shows the real and imaginary components of the ground truth solution to the differential equation and the solution recovered by the fitted sinusoidal network. Training parameters. The sinusoidal network used is a 5-layer MLP with hidden size 256, following the proposed initialization scheme above. The parameter ω is set to 16. The Adam optimizer is used, with a learning rate of 3 • 10 -4 trained for 50, 000 steps. In order to perform the subsequent NTK analysis, we first need to formalize definitions for simple sinusoidal networks and SIRENs. The definitions used here adhere to the common NTK analysis practices, and thus differ slightly from practical implementation. Definition 1. For the purposes of the following proofs, a (sinusoidal) fully-connected neural network with L hidden layers that takes as input x ∈ R n0 , is defined as the function f (L) : R n0 → R n L+1 , recursively given by f (0) (x) = ω W (0) x + b (0) , f (L) (x) = W (L) 1 √ n L sin f (L-1) + b (L) , where ω ∈ R. The parameters W (j) L j=0 have shape n j+1 × n j and all have each element sampled independently either from N (0, 1) (for simple sinusoidal networks) or from U(-c, c) with some bound c ∈ R (for SIRENs). The b (j) L j=0 are n j+1 -dimensional vectors sampled independently from N (0, I nj+1 ). With this definition, we now state the general formulation of the NTK, which applies in general to fully-connected networks with Lipschitz non-linearities, and consequently in particular to the sinusoidal networks studied here as well. Let us first define the NNGP, which has covariance recursively defined by Σ (L+1) (x, x) = E f ∼N (0,Σ (L) ) [σ(f (x))σ(f (x))] + β 2 , with base case Σ (1) (x, x) = 1 n0 x T x + β 2 , and where β gives the variance of the bias terms in the neural network layers (Neal, 1994; Lee et al., 2018) . Now the NTK is given by the following theorem. Theorem 6. For a neural network with L hidden layers f (L) : R n0 → R n L+1 following Definition 1, as the size of the hidden layers n 1 , . . . , n L → ∞ sequentially, the neural tangent kernel (NTK) of f (L) converges in probability to the deterministic kernel Θ (L) defined recursively as Θ (0) (x, x) = Σ (0) (x, x) = ω 2 x T x + 1 , Θ (L) (x, x) = Θ (L-1) (x, x) Σ(L) (x, x) + Σ (L) (x, x), where Σ (l) L l=0 are the neural network Gaussian processes (NNGPs) corresponding to each f (l) and Σ(l) (x, x) = E (u,v)∼Σ (l-1) (x,x) [cos(u)cos(v)] . Proof. This is a standard general NTK theorem, showing that the limiting kernel recursively in terms of the network's NNGPs and the previous layer's NTK. For brevity we omit the proof here and refer the reader to, for example, Jacot et al. (2020) . The only difference is for the base case Σ (0) , due to the fact that we have an additional ω parameter in the first layer. It is simple to see that the neural network with 0 hidden layers, i.e. the linear model ω W (0) x + b (0) will lead to the same Gaussian process covariance kernel as the original proof, x T x + 1, only adjusted by the additional variance factor ω 2 . Theorem 6 demonstrates that the NTK can be constructed as a recursive function of the NTK of previous layers and the network's NNGPs. In the following sections we will derive the NNGPs for the SIREN and the simple sinusoidal network directly. We will then use these NNGPs with Theorem 6 to derive their NTKs as well. To finalize this preliminary section, we also provide two propositions that will be useful in following proofs in this section. Completing the square, we get d j=1 1 √ 2π ∞ -∞ e iω wj xj e -1 2 w 2 j dw j = d j=1 1 √ 2π ∞ -∞ e 1 2 (i 2 ω 2 x 2 j -i 2 ω 2 x 2 j +2ixj wj -w 2 j ) dw j = d j=1 e 1 2 i 2 ω 2 x 2 j 1 √ 2π ∞ -∞ e -1 2 (i 2 ω 2 x 2 j -2iω 2 xj wj +w 2 j ) dw j = d j=1 e -1 2 ω 2 x 2 j 1 √ 2π ∞ -∞ e -1 2 (wj -iωxj ) 2 dw j . Since the integral and its preceding factor constitute a Gaussian pdf, they integrate to 1, leaving the final result d j=1 e -ω 2 2 x 2 j = e -ω 2 2 d j=1 x 2 j = e -ω 2 2 ∥xj ∥ 2 2 . Proposition 8. cos(ω w j x j )dw j + i c -c sin(ω w j x j )dw j = sin(ω w j x j ) ωx j c -c -i cos(ω w j x j ) ωx j c -c = 2sin(c ωx j ) ωx j . Finally, plugging this back into the product above, we get 

D.2 SHALLOW SINUSOIDAL NETWORKS

For the next few proofs, we will be focusing on neural networks with a single hidden layer, i.e. L = 1. Expanding the definition above, such a network is given by 0) x + b (0) + b (1) . f (1) (x) = W (1) 1 √ n 1 sin ω W ( (4) The advantage of analysing such shallow networks is that their NNGPs and NTKs have formulations that are intuitively interpretable, providing insight into their characteristics. We later extend these derivations to networks of arbitrary depth.

D.2.1 SIREN

First, let us derive the NNGP for a SIREN with a single hidden layer. Theorem 9. Shallow SIREN NNGP. For a single hidden layer SIREN f (1) : R n0 → R n2 following Definition 1, as the size of the hidden layer n 1 → ∞, f (1) tends (by law of large numbers) to the neural network Gaussian Process (NNGP) with covariance Σ (1) (x, x) = c 2 6   n0 j=1 sinc(c ω (x j -xj )) -e -2ω 2 n0 j=1 sinc(c ω (x j + xj ))   + 1. Proof. We first show that despite the usage of a uniform distribution for the weights, this initialization scheme still leads to an NNGP. In this initial part, we follow an approach similar to Lee et al. (2018) , with the modifications necessary for this conclusion to hold. From our neural network definition, each element f (1) (x) j in the output vector is a weighted combination of elements in W (1) and b (1) . Conditioning on the outputs from the first layer (L = 0), since the sine function is bounded and each of the parameters is uniformly distributed with finite variance and zero mean, the f (1) (x) j become normally distributed with mean zero as n 1 → ∞ by the (Lyapunov) central limit theorem (CLT). Since any subset of elements in f (1) (x) is jointly Gaussian, we have that this outer layer is described by a Gaussian process. Now that we have concluded that this initialization scheme still entails an NNGP, we have that its covariance is determined by σ 2 W Σ (1) + σ 2 b = c 2 3 Σ (1) + 1, where Σ (1) (x, x) = lim n1→∞ 1 n 1 sin f (0) (x) , sin f (0) (x) = lim n1→∞   1 n 1 n1 j=1 sin f (0) (x) j sin f (0) (x) j   = lim n1→∞   1 n 1 n1 j=1 sin ω W (0) j x + b (0) j sin ω W (0) j x + b (0) j   . Now by the law of large number (LLN) the limit above converges to E w∼Un 0 (-c,c), b∼N (0,1) sin ω w T x + b sin ω w T x + b , where w ∈ R n0 and b ∈ R. Omitting the distributions from the expectation for brevity and expanding the exponential definition of sine, we have E 1 2i e iω(w T x+b) -e -iω(w T x+b) 1 2i e iω(w T x+b) -e -iω(w T x+b) = - 1 4 E e iω(w T x+b)+iω(w T x+b) -e iω(w T x+b)-iω(w T x+b) -e -iω(w T x+b)+iω(w T x+b) + e -iω(w T x+b)-iω(w T x+b) = - 1 4 E e iω(w T (x+x)) E e 2iωb -E e iω(w T (x-x)) -E e iω(w T (x-x)) + E e iω(w T (-x-x)) E e -2iωb Applying Propositions 7 and 8 to each expectation above and noting that the sinc function is even, we are left with - 1 4   2 n0 j=1 sinc(c ω (x j + xj )) -2e -2ω 2 n0 j=1 sinc(c ω (x j -xj ))   = 1 2   n0 j=1 sinc(c ω (x j -xj )) -e -2ω 2 n0 j=1 sinc(c ω (x j + xj ))   . For simplicity, if we take the case of a one-dimensional output (e.g., an audio signal or a monochromatic image) with the standard SIREN setting of c = √ 6, the NNGP reduces to Σ (1) (x, x) = sinc √ 6 ω (x -x) -e -2ω 2 sinc √ 6 ω (x + x) + 1. We can already notice that this kernel is composed of sinc functions. The sinc function is the ideal low-pass filter. For any value of ω > 1, we can see the the first term in the expression above will completely dominate the expression, due to the exponential e -2ω 2 factor. In practice, ω is commonly set to values at least one order of magnitude above 1, if not multiple orders of magnitude above that in certain cases (e.g., high frequency audio signals). This leaves us with simply Σ (1) (x, x) = sinc √ 6 ω (x -x) + 1. Notice that not only does our kernel reduce to the sinc function, but it also reduces to a function solely of ∆x = x -x. This agrees with the shift-invariant property we observe in SIRENs, since the NNGP is dependent only on ∆x, but not on the particular values of x and x. Notice also that ω defines the bandwidth of the sinc function, thus determining the maximum frequencies it allows to pass. The general sinc form and the shift-invariance of this kernel can be visualized in Figure 17 , along with the effect of varying ω on the bandwidth of the NNGP kernel. We can see that the NTK of the shallow SIREN, derived below, maintains the same relevant characteristics as the NNGP. We first derive Σ in the Lemma below. Lemma 10. For ω ∈ R, Σ(1) (x, x) : R n0 × R n0 → R is given by Σ (1) (x, x) = c 2 6   n0 j=1 sinc(c ω (x j -xj )) + e -2ω 2 n0 j=1 sinc(c ω (x j + xj ))   + 1. Proof. The proof follows the same pattern as Theorem 9, with the only difference being a few sign changes after the exponential expansion of the trigonometric functions, due to the different identities for sine and cosine. Now we can derive the NTK for the shallow SIREN. Corollary 11. Shallow SIREN NTK. For a single hidden layer SIREN f (1) : R n0 → R n2 following Definition 1, its neural tangent kernel (NTK), as defined in Theorem 6, is given by Θ (1) (x, x) = ω 2 x T x + 1   c 2 6   n0 j=1 sinc(c ω (x j -xj )) -e -2ω 2 n0 j=1 sinc(c ω (x j + xj ))   + 1   + c 2 6   n0 j=1 sinc(c ω (x j -xj )) + e -2ω 2 n0 j=1 sinc(c ω (x j + xj ))   + 1 = c 2 6 ω 2 x T x + 1 + 1 n0 j=1 sinc(c ω (x j -xj )) - c 2 6 ω 2 x T x + 1 -1 e -2ω 2 n0 j=1 sinc(c ω (x j + xj )) + ω 2 x T x + 1 + 1. Proof. Follows trivially by applying Theorem 9 and Lemma 10 to Theorem 6. Though the expressions become more complex due to the formulation of the NTK, we can see that many of the same properties from the NNGP still apply. Again, for reasonable values of ω, the term with the exponential factor e -2ω 2 will be of negligible relative magnitude. With c = √ 6, this leaves us with ω 2 x T x + 1 + 1 n0 j=1 sinc √ 6 ω (x j -xj ) + ω 2 x T x + 1 + 1, which is of the same form as the NNGP, with some additional linear terms x T x. Though these linear terms break the pure shift-invariance, we still have a strong diagonal and the sinc form with bandwidth determined by ω, as can be seen in Figure 18 . Similarly to the NNGP, the SIREN NTK suggests that training a shallow SIREN is approximately equivalent to performing kernel regression with a sinc kernel, a low-pass filter, with its bandwidth defined by ω. This agrees intuitively with the experimental observations from the paper that in order to fit higher frequencies signals, a larger ω is required.

SIMPLE SINUSOIDAL NETWORK

Just as we did in the last section, we will now first derive the NNGP for a simple sinusoidal network, and then use that in order to obtain its NTK as well. As we will see, the Gaussian initialization employed in the SSN has the benefit of rendering the derivations cleaner, while retaining the relevant properties from the SIREN initialization. We observe that a similar derivation of this NNGP (using cosine functions instead of sine) can be found in Pearce et al. (2019) , with a focus on a Bayesian perspective for the result. SIREN Θ (1) (x, x ), x = 0, ω = 30 Theorem 12. Shallow SSN NNGP. For a single hidden layer simple sinusoidal network f (1) : R n0 → R n2 following Definition 1, as the size of the hidden layer n 1 → ∞, f (1) tends (by law of large numbers) to the neural network Gaussian Process (NNGP) with covariance Σ (1) (x, x) = 1 2 e -ω 2 2 ∥x-x∥ 2 2 -e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 + 1. Proof. We again initially follow an approach similar to the one described in Lee et al. (2018) . From our sinusoidal network definition, each element f (1) (x) j in the output vector is a weighted combination of elements in W (1) and b (1) . Conditioning on the outputs from the first layer (L = 0), since the sine function is bounded and each of the parameters is Gaussian with finite variance and zero mean, the f (1) (x) j are also normally distributed with mean zero by the CLT. Since any subset of elements in f (1) (x) is jointly Gaussian, we have that this outer layer is described by a Gaussian process. Therefore, its covariance is determined by σ 2 W Σ (1) + σ 2 b = Σ (1) + 1, where = -1 4 E e iω(w T (x+x)) E e 2iωb -E e iω(w T (x-x)) -E e iω(w T (x-x)) + E e iω(w T (-x-x)) E e -2iωb Σ (1) (x, x) = lim n1→∞ 1 n 1 sin f (0) (x) , sin f (0) (x) = lim n1→∞   1 n 1 n1 j=1 sin f (0) (x) j sin f (0) (x) j   = lim n1→∞   1 n 1 n1 j=1 sin ω W (0) j x + b (0) j sin ω W Applying Proposition 7 to each expectation above, becomes - 1 4 e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 -e -ω 2 2 ∥x-x∥ 2 2 -e -ω 2 2 ∥x+x∥ 2 2 + e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 = 1 2 e -ω 2 2 ∥x-x∥ 2 2 -e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 . We an once again observe that, for practical values of ω, the NNGP simplifies to 1 2 e -ω 2 2 ∥x-x∥ 2 2 + 1. This takes the form of a Gaussian kernel, which is also a low-pass filter, with its bandwidth determined by ω. We note that, similar to the c = √ 6 setting from SIRENs, in practice a scaling factor of √ 2 is applied to the normal activations, as described in Section 3, which cancels out the 1/2 factors from the kernels, preserving the variance magnitude. Moreover, we can also observe again that the kernel is a function solely of ∆x, in agreement with the shift invariance that is also observed in simple sinusoidal networks. Visualizations of this NNGP are provided in Figure 19 . We will now proceed to derive the NTK, which requires first obtaining Σ. Lemma 13. For ω ∈ R, Σ(1) (x, x) : R n0 × R n0 → R is given by Σ(1) (x, x) = 1 2 e -ω 2 2 ∥x-x∥ 2 2 + e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 + 1. Proof. The proof follows the same pattern as Theorem 12, with the only difference being a few sign changes after the exponential expansion of the trigonometric functions, due to the different identities for sine and cosine. Corollary 14. Shallow SSN NTK. For a simple sinusoidal network with a single hidden layer f (1) : R n0 → R n2 following Definition 1, its neural tangent kernel (NTK), as defined in Theorem 6, is given by Θ (1) (x, x) = ω 2 x T x + 1 1 2 e -ω 2 2 ∥x-x∥ 2 2 + e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 + 1 + 1 2 e -ω 2 2 ∥x-x∥ 2 2 -e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 + 1 = 1 2 ω 2 x T x + 1 + 1 e -ω 2 ∥x-x∥ 2 2 - 1 2 ω 2 x T x + 1 -1 e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 + ω 2 x T x + 1 + 1. Proof. Follows trivially by applying Theorem 12 and Lemma 13 to Theorem 6. We again note the vanishing factor e -2ω 2 , which leaves us with 1 2 ω 2 x T x + 1 + 1 e -ω 2 2 ∥x-x∥ 2 2 + ω 2 x T x + 1 + 1. As with the SIREN before, this NTK is still of the same form as its corresponding NNGP. While again we have additional linear terms x T x in the NTK compared to the NNGP, in this case as well the kernel preserves its strong diagonal. It is still close to a Gaussian kernel, with its bandwidth determined directly by ω. We demonstrate this in Figure 20 , where the NTK for different values of ω is shown. Additionally, we also plot a pure Gaussian kernel with variance ω 2 , scaled to match the maximum and minimum values of the NTK. We can observe the NTK kernel closely matches the Gaussian. Moreover, we can also observe that, at x = 0 the maximum value is predicted by k ≈ ω 2 /2, as expected from the scaling factors in the kernel in Equation 5. This NTK suggests that training a simple sinusoidal network is approximately equivalent to performing kernel regression with a Gaussian kernel, a low-pass filter, with its bandwidth defined by ω. We note that even though this sinusoidal network kernel approximates a Gaussian kernel, an actual Gaussian kernel can be recovered if a combination of sine and cosine activations are employed, as demonstrated in Tsuchida (2020) (Proposition 18).

D.3 DEEP SINUSOIDAL NETWORKS

We will now look at the full NNGP and NTK for sinusoidal networks of arbitrary depth. As we will see, due to the recursive nature of these kernels, for networks deeper than the ones analyzed in the previous section, their full unrolled expressions quickly become intractable intuitively, especially for the NTK. Nevertheless, these kernels can still provide some insight, into the behavior of their corresponding networks. Moreover, despite their symbolic complexity, we will also demonstrate empirically that the resulting kernels can be approximated by simple Gaussian kernels, even for deep networks.

D.3.1 SIMPLE SINUSOIDAL NETWORK

As demonstrated in the previous section, simple sinusoidal networks produce simpler NNGP and NTK kernels due to their Gaussian initialization. We thus begin this section by now analyzing SSNs first, starting with their general NNGP. Theorem 15. SSN NNGP. For a simple sinusoidal network with L hidden layers f (L) : R n0 → R n L+1 following Definition 1, as the size of the hidden layers n 1 , . . . , n L → ∞ sequentially, f (L) tends (by law of large numbers) to the neural network Gaussian Process (NNGP) with covariance Σ (L) (x, x), recursively defined as Σ (0) (x, x) = ω 2 x T x + 1 Σ (L) (x, x) = 1 2 e -1 2 (Σ (L-1) (x,x)+Σ (L-1) (x,x)) e Σ (L-1) (x,x) -e -Σ (L-1) (x,x) + 1. Proof. We will proceed by induction on the depth L, demonstrating the NNGP for successive layers as n 1 , . . . , n L → ∞ sequentially. To demonstrate the base case L = 1, let us rearrange Σ (1) from Theorem 12 in order to express it in terms of inner products, Σ (1) (x, x) = 1 2 e -ω 2 2 ∥x-x∥ 2 2 + e -ω 2 2 ∥x+x∥ 2 2 e -2ω 2 + 1 = 1 2 e -ω 2 2 (x T x-2x T x+x T x) -e -ω 2 2 (x T x+2x T x+x T x) e -2ω 2 + 1 = 1 2 e -1 2 [ω 2 (x T x+1)+ω 2 (x T x+1)]+ω 2 (x T x+1) -e -1 2 [ω 2 (x T x+1)+ω 2 (x T x+1)]-ω 2 (x T x+1) + 1. Given the definition of Σ (0) , this is equivalent to 1 2 e -1 2 (Σ (0) (x,x)+Σ (0) (x,x)) e Σ (0) (x,x) -e -Σ (0) (x,x) + 1, which concludes this case. Now given the inductive hypothesis, as n 1 , . . . , n L-1 → ∞ we have that the first L -1 layers define a network f (L-1) with NNGP given by Σ (L-1) (x, x). Now it is left to show that as n L → ∞, we get the NNGP given by Σ (L) . Following the same argument in Theorem 12, the network f (L) (x) = W (L) 1 √ n L sin f (L-1) + b (L) constitutes a Gaussian process given the outputs of the previous layer, due to the distributions of W (L) and b (L) . Its covariance is given by σ 2 W Σ (L) + σ 2 b = Σ (L) + 1, Σ (L) (x, x) = lim n L →∞ 1 n L sin f (L-1) (x) , sin f (L-1) (x) = lim n L →∞   1 n L n L j=1 sin f (L-1) (x) j sin f (L-1) (x) j   . By inductive hypothesis, f (L-1) is a Gaussian process Σ (L-1) (x, x). Thus by the LLN the limit above equals E (u,v)∼N (0,Σ (L-1) (x,x)) [sin(u)sin(v)] . Omitting the distribution from the expectation for brevity and expanding the exponential definition of sine, we have u+v) . E 1 2i e iu -e -iu 1 2i e iv -e -iv = - 1 4 E e i(u+v) -E e i(u-v) -E e -i(u-v) + E e -i( Since u and v are jointly Gaussian, p = u + v and m = u -v are also Gaussian, with mean 0 and variance σ 2 p = σ 2 u + σ 2 v + 2 Cov[u, v] = Σ (L-1) (x, x) + Σ (L-1) (x, x) + 2 Σ (L-1) (x, x), σ 2 m = σ 2 u + σ 2 v -2 Cov[u, v] = Σ (L-1) x) + Σ (L-1) (x, x) -2 Σ (L-1) (x, x ). We can now rewriting the expectations in terms of normalized variables -1 4 E z∼N (0,1) e iσpz -E z∼N (0,1) e iσmz -E z∼N (0,1) e -iσmz + E z∼N (0,1) e -iσpz . Applying Proposition 7 to each expectation, we get 1 2 e -1 2 σ 2 m -e -1 2 σ 2 p = 1 2 e -1 2 (Σ (L-1) (x,x)+Σ (L-1) (x,x)-2 Σ (L-1) (x,x)) -e -1 2 (Σ (L-1) (x,x)+Σ (L-1) (x,x)+2 Σ (L-1) (x,x)) = 1 2 e -1 2 (Σ (L-1) (x,x)+Σ (L-1) (x,x)) e Σ (L-1) (x,x) -e -Σ (L-1) (x,x)) Unrolling the definition beyond L = 1 leads to expressions that are difficult to parse. However, without unrolling, we can rearrange the terms in the NNGP above as Σ (L) (x, x) = 1 2 e -1 2 (Σ (L-1) (x,x)+Σ (L-1) (x,x )) e Σ (L-1) (x,x) -e -Σ (L-1) (x,x) + 1 = 1 2 e -1 2 (Σ (L-1) (x,x)-2Σ (L-1) (x,x)+Σ (L-1) (x,x)) -e -1 2 (Σ (L-1) (x,x)+2Σ (L-1) (x,x)+Σ (L-1) (x,x)) + 1. Since the covariance matrix Σ (L-1) is positive semi-definite, we can observe that the exponent expressions can be reformulated into a quadratic forms analogous to the ones in Theorem 12. We can thus observe that the same structure is essentially preserved through the composition of layers, except for the ω factor present in the first layer. Moreover, given this recursive definition, since the NNGP at any given depth L is a function only of the preceding kernels, the resulting kernel will also be shift-invariant. Let us now derive the Σ kernel, required for the NTK. Lemma 16. For ω ∈ R, Σ(L) (x, x) : R n0 × R n0 → R, is given by Σ(L) (x, x) = 1 2 e -1 2 (Σ (L-1) (x,x)+Σ (L-1) (x,x)) e Σ (L-1) (x,x) + e -Σ (L-1) (x,x) + 1. Proof. The proof follows the same pattern as Theorem 15, with the only difference being a few sign changes after the exponential expansion of the trigonometric functions, due to the different identities for sine and cosine. As done in the previous section, it would be simple to now derive the full NTK for a simple sinusoidal network of arbitrary depth by applying Theorem 6 with the NNGP kernels from above. However, there is not much to be gained by writing the convoluted NTK expression explicitly, beyond what we have already gleaned from the NNGP above. Nevertheless, some insight can be gained from the recursive expression of the NTK itself, as defined in Theorem 6. First, note that, as before, for practical values of ω, Σ ≈ Σ, both converging to simply a single Gaussian kernel. Thus, our NTK recursion becomes Θ (L) (x, x) ≈ Θ (L-1) (x, x) + 1 Σ (L) (x, x). Now, note that when expanded, the form of this NTK recursion is essentially as a product of the Gaussian Σ kernels, Θ (L) (x, x) ≈ . . . Σ (0) (x, x) + 1 Σ (1) (x, x) + 1 . . . Σ (L-1) (x, x) + 1 Σ (L) (x, x) = . . . ω 2 x T x + 1 + 1 Σ (1) (x, x) + 1 . . . Σ (L-1) (x, x) + 1 Σ (L) (x, x). (6) We know that the product of two Gaussian kernels is Gaussian and thus the general form of the kernel should be approximately a sum of Gaussian kernels. As long as the magnitude of one of the terms dominates the sum, the overall resulting kernel will be approximately Gaussian. Empirically, we observe this to be the case, with the inner term containing ω 2 dominating the sum, for reasonable values (e.g., ω > 1 and L < 10). In Figure 21 , we show the NTK for networks of varying depth and ω, together with a pure Gaussian kernel of variance ω 2 , scaled to match the maximum and minimum values of the NTK. We can observe that the NTKs are still approximately Gaussian, with their maximum value approximated by k ≈ 1 2 L ω 2 , as expected from the product of ω 2 and L kernels above. We also observe that the width of the kernels is mainly defined by ω. Since in this case we have a forward problem, we do not have any prior information to base our choice of ω on, besides a maximum limit given by the Nyquist frequency given the sampling for our training data. We thus follow usual machine learning procedures and experiment with a number of small ω values, based on the previous experiments. We find that ω = 4 gives the best results, with a solution MSE of 4.30 • 10 -4 , against an MSE of 1.04 • 10 -3 for the baseline. Training details. We follow the same training procedures as in Raissi et al. (2019a) . The training set is created by randomly sampling 20, 000 points from the domain (x ∈ [-5, 5], t ∈ [0, π/2]) for evaluation for the physics-informed loss. Additionally, 50 points are sampled from each of the boundary and initial conditions for direct data supervision. The neural networks used are 5-layer MLPs with 100 neurons per hidden layer. The network structure is the same for both the tanh and sinusoidal networks. As in the original work, the network is trained first using the Adam optimizer by 50, 000 steps and then by using L-BFGS until convergence. The loss is composed of the sum of an MSE loss over the data points and a physics-informed MSE loss derived from Equation 10.



Figure 1: The NTK for SIREN and SSN at different ω. Top: Kernel values for pairs (x, x) ∈ [-1, 1] 2 . Bottom: Slice at fixed x = 0. SSN plots show a superimposed Gaussian kernel with variance ω -2 scaled to match the max and min values of the NTK. Similarly, SIREN plots show a sinc function.

Figure 2: Left: The test signal used to analyze the behavior of sinusoidal networks. It is created from two orthogonal single frequencies, f (x, y) = cos(128πx) + cos(32πy). Right: Examples of the reconstructed signal from networks with different ω, demonstrating each of the loss levels in Figure 3.

Figure 4: Left: Final training loss for different values of ω in the image fitting generalization experiment. Right: Examples of generalization from half the points using sinusoidal networks with different values of ω. Even though both networks achieve equivalent training loss, the rightmost one, with ω higher than what would be suggested from the Nyquist frequency of the input signal, overfits the data, causing high-frequency noise artifacts in the reconstruction (e.g., notice the sky).

Figure 5: Left:Reconstruction loss for different cutoff frequencies for a low-pass filter applied to the solution of the Burgers equation. Right: Reconstructed solution of the Burgers equation using the identified parameters with the sinusoidal network, together with the position of the sampled training points.

Figure 6: Left: One timestep of the ground truth Navier-Stokes solution. The black rectangle indicates the domain region used for the task. Right: Identified pressure field for the Navier-Stokes equations using the sinusoidal network. Notice that the identification is only possible up to a constant.

Given any c, for X ∼ N 0, 1 3 c 2 and Y ∼ U (-c, c), we have Var[X] = Var[Y ] = 1 3 c 2 .

Figure 7: Top row: Ground truth image. Bottom: Reconstructed with sinusoidal network.

Figure9: Activations for 6 layers of a simplified sinusoidal network in which the input has been shifted by a large value, i.e., x → x + 1000. The distribution characteristics are preserved, demonstrating the sinusoidal network's shift-invariance.

Figure 11: Top row: Frames from ground truth "cat" video. Bottom: Video reconstructed from sinusoidal network.

Figure 12: Top row: Frames from ground truth "bikes" video. Bottom: Video reconstructed from sinusoidal network.

Figures 11 and 12 show sampled frames from the videos used in this experiment, and their respective reconstructions from the fitted sinusoidal networks.

Figure 14: Top row: Ground truth real and imaginary fields. Bottom: Reconstructed with sinusoidal network.

SIGNED DISTANCE FUNCTION (SDF) In these tasks we learn a 3D signed distance function. We learn a function f : R 3 → R, parametrized as a sinusoidal network, to model a signed distance function representing a 3D scene. This function is supervised indirectly from point cloud data of the scene. Figures 16 and 15 show 3D renderings of the volumes inferred from the learned SDFs. Training parameters. The statue point cloud contains 4, 999, 996 points. The room point cloud contains 10, 250, 688 points. These signals are fitted from the input domain [-1, 1] 3 . The sinusoidal network used is a 5-layer MLP with hidden size 256 for the statue and 1024 for the room. The parameter ω is set to 4. The Adam optimizer is used, with a learning rate of 8 • 10 -4 and a batch size of 1400. All models are trained for 190, 000 steps for the statue experiment and for 410, 000 steps for the room experiment.

Figure 15: Rendering of the "room" 3D scene SDF learned by the sinusoidal network from a point cloud.

Figure 16: Rendering of the "statue" 3D scene SDF learned by the sinusoidal network from a point cloud.

For any ω ∈ R, x ∈ R d , E w∼N (0,I d ) e iω(w T x) = e -ω 2Proof. Omitting w ∼ N (0, I d ) from the expectation for brevity, we have E e iω(w T x) = E e iω d j=1 wj xj .By independence of the components of w and the definition of expectation,

For any c, ω ∈ R, x ∈ R d , E w∼U d (-c,c) e iω(w T x) = d j=1 sinc(c ωx j ). Proof. Omitting w ∼ U d (-c, c) from the expectation for brevity, we have E e iω(w T x) = E e iω d j=1 wj xj .By independence of the components of w and the definition of expectation,

ωx j ) ωx j = d j=1sinc(c ωx j ).

Figure 17: The NNGP for SIREN at different ω values. The top row shows the kernel values for pairs (x, x) ∈ [-1, 1] 2 . Bottom row shows a slice at fixed x = 0.

Figure 18: The NTK for SIREN at different ω values. The top row shows the kernel values for pairs (x, x) ∈ [-1, 1] 2 . Bottom row shows a slice at fixed x = 0.

Now by the LLN the limit above converges to E w∼N (0,In 0 ),b∼N (0,1) sin ω w T x + b sin ω w T x + b , where w ∈ R n0 and b ∈ R. Omitting the distributions from the expectation for brevity and expanding the exponential definition of sine, we haveE 1 2ie iω(w T x+b) -e -iω(w T x+b) 1 2i e iω(w T x+b) -e -iω(w T x+b) (w T x+b)+iω(w T x+b) -e iω(w T x+b)-iω(w T x+b) -e -iω(w T x+b)+iω(w T x+b) + e -iω(w T x+b)-iω(w T x+b)

Figure 19: The NNGP for SSN at different ω values. The top row shows the kernel values for pairs (x, x) ∈ [-1, 1] 2 . Bottom row shows a slice at fixed x = 0.

Figure 20: The NTK for SSN at different ω values. The top row shows the kernel values for pairs (x, x) ∈ [-1, 1] 2 . Bottom row shows a slice at fixed x = 0, together with a Gaussian kernel scaled to match the maximum and minimum values of the NTK.

Figure 21: The NTK for SSN at different ω and network depth (L) values. Kernel values at a slice for fixed x = 0 are shown, together with a Gaussian kernel scaled to match the maximum and minimum values of the NTK.

Figure 25 shows the solution from the sinusoidal network, together with the position of the sampled data points used for training.

Figure 25: Solution to the Schrodinger equation with the sinusoidal network, together with the position of the sampled data points used for training.

Comparison of the simple sinusoidal network and SIREN results, both directly from Sitzmann et al. (2020) and from our own reproduced experiments. Values above the horizontal center line are peak signal to noise ratio (PSNR), values below are mean squared error (MSE), except for SDF which uses a composite loss. † Audio experiments utilized a separate learning rate for the first layer.

Generalization results and the respective tuned ω value. Generalization values are mean squared error (MSE). We can observe the best performing ω for generalization is half the ω used previously for fitting the full signal due to the fact that this task used half the sample points from previously.

Comparison of the sinusoidal network and MLP with tanh non-linearity on PINN experiments from

Comparison of the simple sinusoidal network and SIREN on some experiments, with a longer training duration. The specific durations are described below in the details for each experiment. We can see that the simple sinusoidal network has stronger asymptotic performance. Values above the horizontal center line are peak signal to noise ratio (PSNR), values below are mean squared error (MSE). † Audio experiments utilized a different learning rate for the first layer, see the full description below for details.

annex

pass filters. Due to the SIREN initialization, its NNGP and NTK were previously shown to have more complex expressions. However, we will show in this section that the sinc kernel that arises from the shallow SIREN is gradually "dampened" as the depth of the network increases, gradually approximating a Gaussian kernel.Theorem 17. SIREN NNGP. For a SIREN with L hidden layers f (L) : R n0 → R n L+1 following Definition 1, as the size of the hidden layers n 1 , . . . , n L → ∞ sequentially, f (L) tends (by law of large numbers) to the neural network Gaussian Process (NNGP) with covariance Σ (L) (x, x), recursively defined asProof. Intuitively, after the first hidden layer, the inputs to every subsequent hidden layer are of infinite width, due to the NNGP assumptions. Therefore, due to the CLT, the pre-activation values at every layer are Gaussian, and the NNGP is unaffected by the uniform weight initialization (compared to the Gaussian weight initialization case). The only layer for which this is not the case is the first layer, since the input size is fixed and finite. This gives rise to the different Σ (1) .Formally, this proof proceed by induction on the depth L, demonstrating the NNGP for successive layers as n 1 , . . . , n L → ∞ sequentially. The base case comes straight from Theorem 9. After the base case, the proof follows exactly the same as in Theorem 15.For the same reasons as in the proof above, the Σ kernels after the first layer are also equal to the ones for the simple sinusoidal network, given in Lemma 16.Given the similarity of the kernels beyond the first layer, the interpretation of this NNGP is the same as discussed in the previous section for the simple sinusoidal network.Analogously to the SSN case before, the SIREN NTK expansion can also be approximated as a product of Σ kernels, as in Equation 6. The product of a sinc function with L -1 subsequent Gaussians "dampens" the sinc, such that as the network depth increases the NTK approaches a Gaussian, as can be seen in Figure 22 .

E EXPERIMENTAL DETAILS E.1 GENERALIZATION

Where not explicitly commented, details for the generalization experiments are the same for the comparisons to SIREN, described in Appendix C.

E.2 BURGERS EQUATION (IDENTIFICATION)

We follow the same training procedures as in Raissi et al. (2019a) . The training set is created by randomly sampling 2, 000 points from the available exact solution grid (shown in Figure 5 ). The neural networks used are 9-layer MLPs with 20 neurons per hidden layer. The network structure is the same for both the tanh and sinusoidal networks. As in the original work, the network is trained by using L-BFGS to minimize a mean square error loss composed of the sum of an MSE loss over the data points and a physics-informed MSE loss derived from the equation

E.3 NAVIER-STOKES (IDENTIFICATION)

We follow the same training procedures as in Raissi et al. (2019a) . The training set is created by randomly sampling 5, 000 points from the available exact solution grid (one timestep is shown in Figure 6 ). The neural networks used are 9-layer MLPs with 20 neurons per hidden layer. The network structure is the same for both the tanh and sinusoidal networks. As in the original work, the network is trained by using the Adam optimizer to minimize a mean square error loss composed of the sum of an MSE loss over the data points and a physics-informed MSE loss derived from the equations u t + λ 1 (uu x + vu y ) = -p x + λ 2 (u xx + u yy ) (8) v t + λ 1 (uv x + vv y ) = -p y + λ 2 (v xx + v yy ).(9)

E.4 SCHR ÖDINGER (INFERENCE)

This experiment reproduces the Schrödinger equation experiment from Raissi et al. (2019a) . In this experiment, we are trying to find the solution to the Schrödinger equation, given by ih t + 0.5h xx + |h| 2 h = 0 (10)

