SIMPLE INITIALIZATION AND PARAMETRIZATION OF SI-NUSOIDAL NETWORKS VIA THEIR KERNEL BANDWIDTH

Abstract

Neural networks with sinusoidal activations have been proposed as an alternative to networks with traditional activation functions. Despite their promise, particularly for learning implicit models, their training behavior is not yet fully understood, leading to a number of empirical design choices that are not well justified. In this work, we first propose a simplified version of such sinusoidal neural networks, which allows both for easier practical implementation and simpler theoretical analysis. We then analyze the behavior of these networks from the neural tangent kernel perspective and demonstrate that their kernel approximates a low-pass filter with an adjustable bandwidth. Finally, we utilize these insights to inform the sinusoidal network initialization, optimizing their performance for each of a series of tasks, including learning implicit models and solving differential equations.

1. INTRODUCTION

Sinusoidal networks are neural networks with sine nonlinearities, instead of the traditional ReLU or hyperbolic tangent. They have been recently popularized, particularly for applications in implicit representation models, in the form of SIRENs (Sitzmann et al., 2020) . However, despite their popularity, many aspects of their behavior and comparative advantages are not yet fully understood. Particularly, some initialization and parametrization choices for sinusoidal networks are often defined arbitrarily, without a clear understanding of how to optimize these settings in order to maximize performance. In this paper, we first propose a simplified version of such sinusoidal networks, that allows for easier implementation and theoretical analysis. We show that these simple sinusoidal networks can match and outperform SIRENs in implicit representation learning tasks, such as fitting videos, images and audio signals. We then analyze sinusoidal networks from a neural tangent kernel (NTK) perspective (Jacot et al., 2018) , demonstrating that their NTK approximates a low-pass filter with adjustable bandwidth. We confirm, through an empirical analysis this theoretically predicted behavior also holds approximately in practice. We then use the insights from this analysis to inform the choices of initialization and parameters for sinusoidal networks. We demonstrate we can optimize the performance of a sinusoidal network by tuning the bandwidth of its kernel to the maximum frequency present in the input signal being learned. Finally, we apply these insights in practice, demonstrating that "well tuned" sinusoidal networks outperform other networks in learning implicit representation models with good interpolation outside the training points, and in learning the solution to differential equations.

2. BACKGROUND AND RELATED WORK

Sinusoidal networks. Sinusoidal networks have been recently popularized for implicit modelling tasks by sinusoidal representation networks (SIRENs) (Sitzmann et al., 2020) . They have also been evaluated for physics-informed learning, demonstrating promising results in a series of domains (Raissi et al., 2019b; Song et al., 2021; Huang et al., 2021b; a; Wong et al., 2022) . Among the benefits of such networks is the fact that the mapping of inputs through an (initially) random linear layer followed by a sine function is mathematically equivalent to a transformation to a random Fourier basis, rendering them close to networks with Fourier feature transforms (Tancik et al., 2020; Rahimi & Recht, 2007) , and possibly able to address spectral bias (Basri et al., 2019; Rahaman et al., 2019; Wang et al., 2021) . Sinusoidal networks also have the property that the derivative of their outputs is given simply by another sinusoidal network, due to the fact that the derivative of sine function is a phase-shifted sine. Neural tangent kernel. An important prior result to the neural tangent kernel (NTK) is the neural network Gaussian process (NNGP). At random initialization of its parameters θ, the output function of a neural network of depth L with nonlinearity σ, converges to a Gaussian process, called the NNGP, as the width of its layers n 1 , . . . , n L → ∞. (Neal, 1994; Lee et al., 2018) . This result, though interesting, does not say much on its own about the behavior of trained neural networks. This role is left to the NTK, which is defined as the kernel given by Θ(x, x) = ⟨∇ θ f θ (x), ∇ θ f θ (x)⟩. It can be shown that this kernel can be written out as a recursive expression involving the NNGP. Importantly, Jacot et al. ( 2018) demonstrated that, again as the network layer widths n 1 , . . . , n L → ∞, the NTK is (1) deterministic at initialization and (2) constant throughout training. Finally, it has also been demonstrated that under some assumptions on its parametrization, the output function of the trained neural network f θ converges to the kernel regression solution using the NTK (Lee et al., 2020; Arora et al., 2019) . In other words, under certain assumptions the behavior of a trained deep neural network can be modeled as kernel regression using the NTK. Physics-informed neural networks. Physics-informed neural networks (Raissi et al., 2019a) are a method for approximating the solution to differential equations using neural networks (NNs). In this method, a neural network û(t, x; θ), with learned parameters θ, is trained to approximate the actual solution function u(t, x) to a given partial differential equation (PDE). Importantly, PINNs employ not only a standard "supervised" data loss, but also a physics-informed loss, which consists of the differential equation residual N . Thus, the training loss consists of a linear combination of two loss terms, one directly supervised from data and one informed by the underlying differential equations.

3. SIMPLE SINUSOIDAL NETWORKS

There are many details that complicate the practical implementation of current sinusoidal networks. We aim to propose a simplified version of such networks in order to facilitate theoretical analysis and practical implementation, by removing such complications. As an example we can look at SIRENs, which have their layer activations defined as f l (x) = sin(ω(W l x + b l )). Then, in order to cancel the ω factor, layers after the first one have their weight initialization follow a uniform distribution with range [- √ 6/n ω , √ 6/n ω ], where n is the size of the layer. Unlike the other layers, the first layer is sampled from a uniform distribution with range [-1/n, 1/n]. We instead propose a simple sinusoidal network, with the goal of formulating an architecture that mainly amounts to substituting its activation functions by the sine function. We will, however, keep the ω parameter, since (as we will see in future analyses) it is in fact a useful tool for allowing the network to fit inputs of diverse frequencies. The layer activation equations of our simple sinusoidal network, with parameter ω, are defined as f 1 (x) = sin(ω (W 1 x + b 1 )), f l (x) = sin(W l x + b l ), l > 1. (1) Finally, instead of utilizing a uniform initialization as in SIRENs (with different bounds for the first and subsequent layers), we propose initializing all parameters in our simple sinusoidal network using a default Kaiming (He) normal initialization scheme. This choice not only greatly simplifies the initialization scheme of the network, but it also facilitates theoretical analysis of the behavior of the network under the NTK framework, as we will see in Section 4. Analysis of the initialization scheme. The initialization scheme proposed above differs from the one implemented in SIRENs. We will now show that this particular choice of initialization distribution preserves the variance of the original proposed SIREN initialization distribution. As a consequence, the original theoretical justifications for its initialization scheme still hold under this activation, namely that the distribution of activations across layers are stable, well-behaved and shift-invariant. Due to space constraints, proofs are presented in Appendix A. Moreover, we also demonstrate empirically that these properties are maintained in practice. Lemma 1. Given any c, for X ∼ N 0, 



1 3 c 2 and Y ∼ U (-c, c), we have Var[X] = Var[Y ] = 1 3 c 2 .

