RANDOM WEIGHT FACTORIZATION IMPROVES THE TRAINING OF CONTINUOUS NEURAL REPRESENTA-TIONS

Abstract

Continuous neural representations have recently emerged as a powerful and flexible alternative to classical discretized representations of signals. However, training them to capture fine details in multi-scale signals is difficult and computationally expensive. Here we propose random weight factorization as a simple drop-in replacement for parameterizing and initializing conventional linear layers in coordinate-based multi-layer perceptrons (MLPs) that significantly accelerates and improves their training. We show how this factorization alters the underlying loss landscape and effectively enables each neuron in the network to learn using its own self-adaptive learning rate. This not only helps with mitigating spectral bias, but also allows networks to quickly recover from poor initializations and reach better local minima. We demonstrate how random weight factorization can be leveraged to improve the training of neural representations on a variety of tasks, including image regression, shape representation, computed tomography, inverse rendering, solving partial differential equations, and learning operators between function spaces.

1. INTRODUCTION

Some of the recent advances in machine learning can be attributed to new developments in the design of continuous neural representations, which employ coordinate-based multi-layer perceptrons (MLPs) to parameterize discrete signals (e.g. images, videos, point clouds) across space and time. Such parameterizations are appealing because they are differentiable and much more memory efficient than grid-sampled representations, naturally allowing smooth interpolations to unseen input coordinates. As such, they have achieved widespread success in a variety of computer vision and graphics tasks, including image representation (Stanley, 2007; Nguyen et al., 2015) , shape representation (Chen & Zhang, 2019; Park et al., 2019; Genova et al., 2019; 2020 ), view synthesis (Sitzmann et al., 2019;; Saito et al., 2019; Mildenhall et al., 2020; Niemeyer et al., 2020) , texture generation (Oechsle et al., 2019; Henzler et al., 2020) , etc. Coordinate-based MLPs have also been applied to scientific computing applications such as physics-informed neural networks (PINNs) for solving forward and inverse partial differential equations ( PDEs Despite their flexibility, it has been shown both empirically and theoretically that coordinate-based MLPs suffer from "spectral bias" (Rahaman et al., 2019; Cao et al., 2019; Xu et al., 2019) . This manifests as a difficulty in learning the high frequency components and fine details of a target function. A popular method to resolve this issue is to embed input coordinates into a higher dimensional space, for example by using Fourier features before the MLP (Mildenhall et al., 2020; Tancik et al., 2020) . Another widely used approach is the use of SIREN networks (Sitzmann et al., 2020) , which employs MLPs with periodic activations to represent complex natural signals and their derivatives. One main limitation of these methods is that a number of associated hyper-parameters (e.g. scale factors) need to be carefully tuned in order to avoid catastrophic generalization/interpolation errors. Unfortunately, the selection of appropriate hyper-parameters typically requires some prior knowledge about the target signals, which may not be available in some applications. More general approaches to improve the training and performance of MLPs involve different types of normalizations, such as Batch Normalization (Ioffe & Szegedy, 2015), Layer Normalization (Ba et al., 2016) and Weight Normalization (Salimans & Kingma, 2016) . However, despite their remarkable success in deep learning benchmarks, these techniques are not widely used in MLP-based neural representations. Here we draw motivation from the work of (Salimans & Kingma, 2016; Wang et al., 2021a) and investigate a simple yet remarkably effective re-parameterization of weight vectors in MLP networks, coined as random weight factorization, which provides a generalization of Weight Normalization and demonstrates significant performance gains. Our main contributions are summarized as • We show that random weight factorization alters the loss landscape of a neural representation in a way that can drastically reduce the distance between different parameter configurations, and effectively assigns a self-adaptive learning rate to each neuron in the network. • We empirically illustrate that random weight factorization can effectively mitigate spectral bias, as well as enable coordinate-based MLP networks to escape from poor intializations and find better local minima. • We demonstrate that random weight factorization can be used as a simple drop-in enhancement to conventional linear layers, and yield consistent and robust improvements across a wide range of tasks in computer vision, graphics and scientific computing.

2. WEIGHT FACTORIZATION

Let x ∈ R d be the input, g (0) (x) = x and d 0 = d. We consider a standard multi-layer perceptron (MLP) f θ (x) recursively defined by f (l) θ (x) = W (l) • g (l-1) (x) + b (l) , g (l) (x) = σ(f (l) θ (x)), l = 1, 2, . . . , L, with a final layer f θ (x) = W (L+1) • g (L) (x) + b (L+1) , where W (l) ∈ R d l ×d l-1 is the weight matrix in l-th layer and σ is an element-wise activation function. Here, θ = W (1) , b (1) , . . . , W (L+1) , b (L+1) represents all trainable parameters in the network. MLPs are commonly trained by minimizing an appropriate loss function L(θ) via gradient descent. To improve convergence, we propose to factorize the weight parameters associated with each neuron in the network as follows w (k,l) = s (k,l) • v (k,l) , k = 1, 2, . . . , d l , l = 1, 2, . . . , L + 1, where w (k,l) ∈ R d l-1 is a weight vector representing the k-th row of the weight matrix W (l) , s (k,l) ∈ R is a trainable scale factor assigned to each individual neuron, and v (k,l) ∈ R d l-1 . Consequently, the proposed weight factorization can be written by W (l) = diag(s (l) ) • V (l) , l = 1, 2, . . . , L + 1. (2.4) with s ∈ R d l .

2.1. A GEOMETRIC PERSPECTIVE

In this section, we provide a geometric motivation for the proposed weight factorization. To this end, we consider the simplest setting of a one-parameter loss function ℓ(w). For this case, the weight factorization is reduced to w = s • v with two scalars s, v. Note that for a given w ̸ = 0 there are infinitely many pairs (s, v) such that w = s • v. The set of such pairs forms a family of hyperbolas in the sv-plane (one for each choice of signs for both s and v). As such, the loss function in the sv-plane is constant along these hyperbolas.



) Raissi et al. (2019; 2020); Karniadakis et al. (2021), and Deep Operator networks (DeepONets) for learning operators between infinitedimensional function spaces Lu et al. (2021); Wang et al. (2021e).

