DISCRETIZATION INVARIANT LEARNING ON NEURAL FIELDS

Abstract

While neural fields have emerged as powerful representations of continuous data, there is a need for neural networks that can perform inference on such data without being sensitive to how the field is sampled, a property called discretization invariance. We develop DI-Net, a framework for learning discretization invariant operators on neural fields of any type. Whereas current theoretical analyses of discretization invariant networks are restricted to the limit of infinite samples, our analysis does not require infinite samples and establishes upper bounds on the variation in DI-Net outputs given different finite discretizations. Our framework leads to a family of neural networks driven by numerical integration via quasi-Monte Carlo sampling with discretizations of low discrepancy. DI-Nets manifest desirable theoretical properties such as universal approximation of a large class of maps between L 2 functions, and gradients that are also discretization invariant. DI-Nets can also be seen as generalizations of many existing network families as they bridge discrete and continuous network classes, such as convolutional neural networks (CNNs) and neural operators respectively. Experimentally, DI-Nets derived from CNNs are demonstrated to classify and segment visual data represented by neural fields under various discretizations, and sometimes even generalize to new types of discretizations at test time. Code: supplementary materials (URL to be released). Under review as a conference paper at ICLR 2023 definition that bounds the difference between any two finite discretizations, which we show implies the convergence condition. We formulate discretization invariant networks on general metric spaces, which generalizes DeepONets and neural operators, then we focus on the continuous convolution as a core layer for vision applications. Continuous convolutions At the core of many discretization invariant approaches is the continuous convolution, which also provides permutation invariance, translation invariance and locality. Its applications include modeling point clouds (Wang et al., 2021; Boulch, 2019) , graphs (Fey et al., 2017) , fluids (Ummenhofer et al., 2019) , and sequential data (Romero et al., 2021b;a), where there is typically no choice of how the data should be discretized. This work focuses on the effect of different discretizations, proposes quasi-Monte Carlo as a canonical method of generating discretizations, and can produce neural fields as output. Approximation capabilities of neural networks A fundamental result in approximation theory is that the set of single-layer neural networks is dense in a large space of functionals including L p (R n ) (Hornik, 1991) . Subsequent works designed constructive examples using various non-linear activations (Chen et al., 1995; Chen & Chen, 1993) . While this result is readily extended to multidimensional outputs, existing approximation results for the case of infinite dimensional outputs (e.g., L p (R n ) → L p (R n )) do not explicitly characterize the contribution of data discretization to the approximation error (Bhattacharya et al., 2020; Lanthaler et al., 2022; Kovachki et al., 2021b;a). In this section we formalize discretization invariance, define DI-Nets, and derive properties that enable DI-Nets to serve as a general deep learning framework on continuous data. We treat NFs as integrable maps from a domain Ω to R c . In particular, let Ω be a bounded measurable subset of a d-dimensional compact metric space. The most common case is Ω ⊂ R d , and in the Appendix we consider the case of a d-dimensional manifold. Denote the space of NFs as F c = {f θ : Ω → R c : Ω ∥f θ ∥ 2 dµ < ∞ and V (f θ ) < ∞}, where the variation V (f θ ) measures how

1. INTRODUCTION

Neural fields (NFs), which encode signals as the parameters of a neural network, have many useful properties. NFs can efficiently store and stream continuous data (Sitzmann et al., 2020b; Dupont et al., 2022; Gao et al., 2021; Takikawa et al., 2022; Cho et al., 2022) , represent and render detailed 3D scenes at lightning speeds (Müller et al., 2022) , and integrate data from a wide range of modalities (Gao et al., 2022) . NFs are thus an appealing data representation for many applications. However, current approaches for training networks on a dataset of NFs have major limitations. The sampling-based approach converts such data to pixels or voxels as input to discrete networks (Vora et al., 2021) , but it incurs interpolation errors and does not leverage the ability to evaluate the NF anywhere on its domain. The hypernetwork approach trains a model to predict NF parameters (or a lower dimensional "modulation" of such parameters) which can be tailored for downstream tasks (Tancik et al., 2020a; Dupont et al., 2022; Mehta et al., 2021) , but hypernetworks based on the parameter space of one type of NF are incompatible with other types. Moreover, hypernetworks are unsuitable for important classes of NFs whose parameters extend beyond a neural network, such as those with voxel (Sun et al., 2021; Alex Yu and Sara Fridovich-Keil et al., 2021) , octree (Yu et al., 2021) or hash table (Müller et al., 2022; Takikawa et al., 2022) components. We seek to strengthen the sampling-based approach with the notion of discretization invariance: the output of an operator that processes a continuous signal by sampling it at a set of discrete points should be largely independent of how the sample points are chosen, particularly as the number of points becomes large. In this paper we propose the DI-Net, a discretization invariant neural network for learning and inference on neural fields (Fig. 1 ). By parameterizing layers as integrals over parametric functions of the input field, DI-Nets have access to powerful numerical integration techniques that yield strong convergence properties, including a universal approximation theorem for a wide class of maps between function spaces. DI-Nets can be applied to any type of NF, or in fact any data that can be represented as integrable functions on a bounded measurable set. Thus DI-Nets are a broad class of neural networks that encompass other continuous networks such as neural operators, and also extend Figure 1 : The DI-Net processes a neural field by evaluating it on a point set (discretization) which is used to perform numerical integration throughout the network. DI-Nets are interoperable between all types of NFs and can be trained on a broad range of tasks. discrete networks that act on pixels, point clouds, and meshes. They can be applied to classification, segmentation, and many other tasks. Our contributions are as follows: • We show that discretization invariance gives rise to a family of neural networks based on numerical integration, which we call DI-Nets. • Backpropagation through DI-Nets is discretization invariant, and they universally approximate a large class of maps between function spaces. • DI-Nets generalize a wide class of discrete models to the continuous domain, and we derive continuous analogues of convolutional neural networks for inference on neural fields that encode visual data. • We demonstrate convolutional DI-Nets on NF classification and dense prediction tasks, and show it can perform well under a range of discretization schemes. • We probe the limits of discretization invariance in practice, finding that DI-Net has some ability to generalize to new discretizations at test time, modulated by the task and the type of discretizations it was trained on.

2. RELATED WORK

Neural fields Multilayer perceptrons (MLPs) can be trained to capture a wide range of continuous data with high fidelity. The most prominent domains include shapes (Park et al., 2019; Mescheder et al., 2018) , objects (Niemeyer et al., 2020; Müller et al., 2022) , and 3D scenes (Mildenhall et al., 2020; Sitzmann et al., 2021) , but previous works also apply NFs to gigapixel images (Martel et al., 2021) , volumetric medical images (Corona-Figueroa et al., 2022) , acoustic data (Sitzmann et al., 2020b; Gao et al., 2021) , tactile data (Gao et al., 2022) , depth and segmentation maps (Kundu et al., 2022) , and 3D motion (Niemeyer et al., 2019) . Hypernetworks and modulation networks were developed for learning directly with NFs, and have been demonstrated on tasks including generative modeling, data imputation, novel view synthesis and classification (Sitzmann et al., 2020b; 2021; Tancik et al., 2020a; Sitzmann et al., 2019; 2020a; Mehta et al., 2021; Chan et al., 2021; Dupont et al., 2021; 2022) . Hypernetworks use meta-learning to learn to produce the MLP weights of desired output NFs, while modulation networks predict modulations that can be used to transform the parameters of an existing NF or generate a new NF. Another approach for learning NF→NF maps evaluates an input NF at grid points, produces features at the same points via a U-Net, and passes interpolated features through an MLP to produce output values at arbitrary query points (Vora et al., 2021) . Discretization invariant networks Networks that are agnostic to the discretization of the data domain has been explored in several contexts. Hilbert space PCA, DeepONets and neural operators learn discretization invariant maps between function spaces (Bhattacharya et al., 2020; Lu et al., 2021; Li et al., 2020; Kovachki et al., 2021b) , and are tailored to solve partial differential equations efficiently. On surface meshes, DiffusionNet (Sharp et al., 2022) uses the diffusion operator to achieve convergent behavior under mesh refinement. These previous works define discretization invariance as convergent behavior in the limit of infinite sample points, but do not characterize how different discretizations yield different behaviors in the finite case. In this work, we choose a stronger much an NF fluctuates over its domain. 1 The variation of a 1D function f ∈ C 1 ([a, b] ) is given by: V (f ) = b a |f ′ (x)|dx, and more general definitions are given in Appendix A.1. We call θ the NF's parameters, d its dimensionality, and c its number of channels. For example, an occupancy network (Mescheder et al., 2018) is 3-dimensional and has 1 channel. NeRF (Mildenhall et al., 2020) is 5-dimensional (3 world coordinates and 2 view angles) and has 4 channels. The discretization of an NF is a point set X ⊂ Ω on which the field is evaluated. We say that a map H : F c → R n is discretizable if it induces a map ĤX : F c → R n that depends only on the input's values at X. 2 3.2 DISCRETIZATION INVARIANCE VIA NUMERICAL INTEGRATION We consider two notions of discretization invariance: (1) upper bounding the deviation of the map ĤX from ĤY for any two discretizations X, Y , and (2) establishing convergence of ĤX N to H under particular sequences {X N } N ∈N of discretizations. We use the first notion to define discretization invariant maps, then characterize sequences of discretizations under which such maps converge. Definition 1. A discretizable map H : F c → R n is discretization invariant if for every discretization X and neural field f θ , H[f θ ] -ĤX [f θ ] of X. A map H : F c → F n is discretization invariant if H[•](x) is discretization invariant for all x ∈ Ω. This definition establishes an upper bound on the deviation between any two discretizations by a simple application of the triangle inequality. The discrepancy of a discretization is lower for dense, evenly distributed points. For a 1D point set it is given by: D({x i } N i=1 ) = sup a≤c≤d≤b |{x 1 , . . . , x N } ∩ [c, d]| N - d -c b -a . See Appendix A.1 for general definitions of discrepancy. The product of variation and discrepancy is precisely the upper bound in the celebrated Koksma-Hlawka inequality, which bounds the difference between the integral of a function h ∈ L 2 (Ω) and its sample mean on any point set X ⊂ Ω: 1 |X| x ′ ∈X h(x ′ ) - Ω h(x) dx ≤ V (h) D(X). This naturally leads to a family of discretization invariant (DI) layers which specify a parametric map on neural fields and output its sample mean. Specifically, the action of a DI layer H ϕ on a neural field f θ under discretization X is: ĤX ϕ : f θ → 1 |X| x∈X H ϕ [f θ ](x). We propose two forms of H ϕ [f ]: • Vector-valued DI layers ( ĤX ϕ : F c → R n ) with H ϕ [f θ ](x) = h ϕ (x, f θ (x)) ∈ R n . Such layers include global pooling and learned inner products, and could be used as one of the final layers in a classification network. • NF-valued DI layers ( ĤX ϕ : F c → F n ) with H ϕ [f θ ](x) : x ′ → h ϕ (x, x ′ , f θ (x), f θ (x ′ )) ∈ R n for all x ′ ∈ Ω. foot_2 Such layers include continuous convolutions and deconvolutions, self-attention, and pooling. In each case, h ϕ must be bounded and continuous in all its variables (so that outputs remain of bounded variation), differentiable w.r.t. ϕ (to enable backpropagation), and Gateaux differentiable w.r.t. f θ (which will make backpropagation discretization invariant, as we discuss in Section 3.4). We consider more general discretization invariant maps in Appendix A.2. Importantly, the functional form of H ϕ does not depend on the NF parameters θ except through f θ (x). Thus DI layers are invariant to the parameterization of the NFs that it takes as input. This property allows these layers to be applied to a mixture of NF types, which is not possible with hypernetwork or modulation-based learning approaches. Lastly we note that many common loss functions and regularizers generalize naturally to the continuous domain as bounded continuous maps F c → R (e.g., L2 regularization) or F c × F c → R (e.g., mean squared error), so the properties of DI layers extend to losses on NFs.

3.3. DISCRETIZATION INVARIANT NETWORKS

A DI-Net is a directed acyclic graph of the following types of layers: • DI layers: numerical integrators mapping an NF to a new NF or vector, as defined above Since pointwise layers preserve an NF's property of bounded variation, DI-Net is discretization invariant. A prototypical DI-Net for classification might consist of NF-valued DI layers separated by normalization and activation layers, and end with a vector-valued DI layer followed by softmax. Grounding our network architecture in DI layers opens up a rich toolkit of numerical integration methods. We can generate low discrepancy discretizations using quasi-Monte Carlo (QMC), a numerical integration method with favorable convergence rates compared to standard Monte Carlo (Caflisch, 1998) . The QMC discretization only requires a single pass through the network, can be deterministic or pseudorandom, and can accelerate computation when the same discretization is used for multiple network layers or all fields in a minibatch. To sample from non-uniform measures, the standard Monte Carlo method with rejection sampling can be used instead. To integrate over a fixed discretization of high discrepancy, we can use a quadrature method that replaces 1/|X| in the sample mean with quadrature weights. Adaptive quadrature techniques can be used to attain specific error bounds at inference time, which can be valuable in applications requiring robustness or verification.

3.4. CONVERGENCE UNDER EQUIDISTRIBUTED DISCRETIZATIONS

We call a sequence of discretizations {X N } N ∈N whose discrepancy tends to 0 as N → ∞ an equidistributed discretization sequence. 4 By Equation (3), DI layers converge under such sequences, i.e., lim N →∞ ĤX N ϕ ≡ H ϕ , and hence forward passes through DI-Nets are also convergent. But demonstrating convergence of DI-Net's discretized gradients is less straightforward. Consider a scalar-valued DI layer on a single-channel NF, H ϕ : F 1 → R. The derivatives of its output w.r.t. each of its weights ϕ k can be shown to converge as: lim N →∞ ∂ ∂ϕ k ĤX N ϕ ≡ ∂ ∂ϕ k H ϕ , under any equidistributed discretization sequence {X N } N ∈N . Describing the derivative of the layer's output w.r.t. the input NF is more nuanced, since pointwise derivatives are not sufficient to represent backpropagation in the continuous case. We must instead use the Gateaux derivative, which describes the linear change in a map between functions given an infinitesimal change in the input function. We prove the following in Appendix B.1: Proposition 1. For every f θ ∈ F 1 and fixed x ∈ Ω, we can design a sequence of bump functions {ψ N x } N ∈N which is 1 in a small neighborhood around x and vanishes at each X N \{x}, such that: lim N →∞ ∂ ∂f θ (x) ĤX N ϕ [f θ ] = lim N →∞ dH ϕ [f θ ; ψ N x ], where dH ϕ [f θ ; ψ N x ] is the Gateaux derivative when f θ is perturbed in the direction of ψ N x . Using the chain rule for Gateaux derivatives, we can then show that backpropagation through the entire DI-Net is convergent. These properties can be summarized in the following theorem: Theorem 1. A DI-Net permits backpropagation of its outputs with respect to its input as well as all its learnable parameters. The gradients converge under any equidistributed discretization sequence.

3.5. UNIVERSAL APPROXIMATION THEOREM

Functions of bounded variation are piecewise smooth, hence they can be represented as the integral of some function. Since numerical integration can yield arbitrarily small invariance error, DI-Nets are universal approximators in the following sense: Theorem 2. For every Lipschitz continuous map R : F c → F n , c, n ∈ N, there exists a DI-Net that approximates it to arbitrary accuracy w.r.t. a finite measure ν on F c . As a corollary, every Lipschitz continuous map F c → R n or R n → F c can also be approximated by some DI-Net. A high-level sketch of the F 1 → F 1 case is as follows. Appendix B.2 provides a full proof, including the extension to multi-channel NFs. 1. Fix a discretization X of sufficiently low discrepancy to approximate any function in F 1 to desired accuracy. Let N = |X|. 2. Let π be the projection that maps any function f ∈ F 1 to R N by selecting its N values along the discretization. Through π, the measure ν on F 1 induces a measure µ on R N . 3. We can approximate any function in L 2 (R N ) by covering the volume under the graph of the function with almost disjoint rectangles, and then at inference time summing the heights of the rectangles at the given R N input. 4. Note that a multilayer perceptron (MLP) can approximate this rectangle cover to arbitrary accuracy with sufficiently steep slopes at their boundaries (Lu et al., 2017) . 5. Rx : πf → R[f ](x) is in L 2 (R N ) for each x ∈ X. We want to build a DI-Net that specifies the desired connections on Ω using element-wise products with cutoff functions and linear combinations of channels. The cutoff functions extract the input values along X into separate channels, and the weights of the channels match the weights of the hypothetical MLP from step 4. 6. We repeat this construction N times to specify values at each of the N output points in X, and map all other output points to the value of the closest specified point. Then we have fully specified the desired behavior of f → R[f ] to desired accuracy w.r.t. the measure ν. This construction points to the similarity between specifying a map between F 1 → F 1 and specifying maps between two grids of N points. Thus many of the strategies for imposing structure on how different points influence each other can inform DI-Net design. For example, if the influence should be local and translation-invariant, we can design convolutional DI-Nets. If the influence should be sparse, we can design attention-based DI-Nets. If the influence should be distributed globally among small patches, we can design transformer-like DI-Nets with tokenization. If the influence should be modulated by lower frequency patterns, we can use Fourier neural operators (Li et al., 2020) .

4. DESIGN AND IMPLEMENTATION OF DI-NETS

DI-Nets encompass a very large family of neural networks: we have only specified the architecture as a directed acyclic graph, and DI layers include a broad variety of network layers. DI-Nets include DeepONets and neural operators (Kovachki et al., 2021b; Lu et al., 2021) , which can learn general maps between function spaces but in practice are designed to solve partial differential equations. DI-Nets also encompass continuous adaptations of networks designed on discrete domains such as convolutional neural networks (CNNs). In the same way that neural fields extend signals on point clouds, meshes, grids, and graphs to a compact metric space, DI-Nets extend networks that operate on discrete signals by converting every layer to an equivalent discretizable map. We make this connection concrete in the case of CNNs below.

4.1. CONVOLUTIONAL DI-NETS

We describe how to extend convolutional layers and multi-scale architectures to DI-Nets here (also see Fig. 2 ), and discuss other layers in Appendix C, including normalization, max pooling, tokenization and attention. The resulting convolutional DI-Net can be initialized directly with the weights of a pre-trained CNN as we investigate in Appendix E.1. Convolution For a measurable S ⊂ Ω and a polynomial basis {p j } j≥0 that spans L 2 (S), S is the support of a polynomial convolutional kernel K ϕ : Ω × Ω → R defined by: K ϕ (x, x ′ ) = n j=1 ϕ j p j (x -x ′ ) j if x -x ′ ∈ S 0 otherwise. ( ) for some chosen n ∈ N. A convolution is the linear map H ϕ : F 1 → F 1 given by: H ϕ [f ] = Ω K ϕ (•, x ′ )f (x ′ )dx ′ . ( ) An MLP convolution is defined similarly except the kernel becomes Kϕ (x, x ′ ) = MLP(x -x ′ ; ϕ) in the non-zero case. While MLP kernels are favored over polynomial kernels in many applications due to their expressive power (Wang et al., 2021) , polynomial bases can be used to construct filters The input and output discretizations of the layer can be chosen independently, allowing for padding or striding (see Appendix C). The input discretization fully determines which points on S are evaluated. Multi-scale architectures Many discretizations permit multi-scale structures by subsampling the discretization, and QMC is particularly conducive to such design. Under QMC, downsampling is efficiently implemented by truncating the list of coordinates in the low discrepancy sequence to the desired number of terms, as the truncated sequence is itself low discrepancy. Similarly, upsampling can be implemented by extending the low discrepancy sequence to the desired number of terms, then performing interpolation by copying the nearest neighbor(s) or applying some (fixed or learned) kernel. Residual or skip connections can also be implemented efficiently since downsampling and upsampling are both specified with respect to the same discretization (Fig. 2 right).

4.2. TRAINING DI-NETS

The pipeline for training DI-Nets is similar to that for training discrete networks, except that input and/or output discretizations should be specified. When training a DI-Net classifier, the input discretization may be specified manually or sampled from a low discrepancy sequence to perform QMC integration. When training DI-Nets for dense prediction, the output discretization should be chosen to match the coordinates of the ground truth labels. Any input discretization can be chosenin most experiments we set it equal to the output discretization. At inference time, the network can be evaluated with any output discretization (Fig. D .2), making the output in effect an NF. We outline steps for training a classifier and dense prediction DI-Net in Algorithms 2 and 3.

5. EXPERIMENTS

We apply convolutional DI-Nets to toy classification (NF→scalar) and dense prediction (NF→NF) tasks, and analyze its behavior under different discretizations. Our aim is not to compete with discrete networks on these tasks, but rather to demonstrate that simply deriving DI-Nets from CNNs yields a feasible class of models for discretization invariant learning on NFs, without introducing any new techniques or types of layers. Appendix D provides further experimental details. We discuss techniques that could be leveraged to design more competitive DI-Nets in Appendix F.

5.1. NEURAL FIELD CLASSIFICATION

We perform classification on a dataset of 8,400 NFs fit to a subset of ImageNet1k (Deng et al., 2009) , with 700 samples from each of 12 superclasses (Engstrom et al., 2019) . For each class we train on 500 SIRENs (Sitzmann et al., 2020b) and evaluate on 200 Gaussian Fourier feature networks (Tancik et al., 2020b) . We train DI-Nets with 2 and 4 MLP convolutional layers, as well as CNNs with similar architectures. We also train an MLP that predicts class labels from SIREN parameters, and a "non-uniform convolution" (Jiang et al., 2019) that applies a non-uniform Fourier transform to input points (sampled with QMC) to map them to grid values, then applies a 2-layer CNN. Figure 3 : Classifier performance with different resolutions at test time. Each network is trained for 8K iterations with a learning rate of 10 -3 . In training, the CNNs sample neural fields along the 32 × 32 grid. DI-Nets and the non-uniform network sample 1024 points generated from a scrambled Sobol sequence (QMC discretization). We evaluate models with top-1 accuracy at the same resolution as well as at several other resolutions. The MLP and the non-uniform method significantly underperform DI-Net, with 13.9% and 28.9% accuracy respectively compared to 32.9% for our 2-layer network. At 32 × 32 resolution, DI-Nets somewhat underperform their CNN counterparts, and this performance gap is larger for deeper models. However, our discretization invariant model better generalizes to images of different resolutions than CNNs (Fig. 3 ), particularly at lower resolutions. We next examine whether DI-Net can adapt to an entirely different type of discretization at test time. We use grid, QMC and shrunk discretizations of 1024 (32×32) points. The Shrunk discretization shrinks a Sobol (QMC) sequence towards the center of the image (each point x ∈ [-1, 1] 2 is mapped to x 2 sgn(x)). In image classification, the object of interest is usually centered, hence the shrunk→shrunk setting performs on par with other discretizations despite its higher discrepancy. Interestingly, changing discretization type at inference time has varying impact. Usually it only slightly degrades DI-Net's accuracy (Table 1 ), but performance falls dramatically when shifting from high to low discrepancy discretizations (shrunk→QMC). Thus discretization invariance only provides a weak guarantee on the stability of a model's behavior, and points to the importance of training on the right discretizations to attain a network that generalizes well to other discretizations for the given task.

5.2. NEURAL FIELD SEGMENTATION

We perform semantic segmentation of SIRENs fit to street view images from Cityscapes (Cordts et al., 2016) , grouping segmentation labels into 7 categories. We train on 2975 NFs with coarsely annotated segmentations only, and test on 500 NFs with both coarse and fine annotations (Fig. 4 ). We use a 48 × 96 grid discretization since segmentation labels are only given at pixel coordinates. We compare the performance of 3 and 5 layer DI-Nets and fully convolutional networks (FCNs), as well as a non-uniform CNN (Jiang et al., 2019) . We also train a hypernetwork that learns to map each SIREN to the parameters of a new SIREN representing its segmentation. Networks are trained for 10K iterations with a learning rate of 10 -3 . We evaluate each model with mean intersection over union (mIoU) and pixel-wise accuracy (PixAcc). The hypernetwork and non-uniform CNN perform poorly compared to both FCNs and DI-Nets (Table 2 ). DI-Net-3 outperforms the equivalent FCN, and less often confuses features such as shadows and road markings (Fig. 4 ). However, the performance deteriorates when downsampling and upsampling layers are added (DI-Net-5), echoing the difficulty in scaling DI-Nets observed in classification. We suggest potential methods for remedying this in Appendix F. 

5.3. SIGNED DISTANCE FUNCTION PREDICTION

We train a convolutional DI-Net to map a field of RGBA values in 3D to its signed distance function (SDF). We create a synthetic dataset of 3D scenes with randomly colored balls embedded in 3D space, and train on RGBA-SDF pairs using a mean squared error (MSE) loss on the predicted SDF. We train with grid (16 × 16 × 16), QMC, shrunk, Monte Carlo (i.i.d. points drawn uniformly from the domain), and mixed discretizations of 4096 points. In the mixed setting, each minibatch uses one of the other four discretizations at random. Under any fixed discretization, the convolutional DI-Net significantly outperforms the equivalent discrete network (MSE of 0.022 vs. 0.067 respectively). In Figure D .2, we illustrate our model's ability to also produce outputs that are discretized differently than the input, making DI-Net's output in effect a neural field. By changing the output discretization of the last convolutional DI-Net layer, we can evaluate the output SDF anywhere on the domain without changing the input discretization. Whereas the discrete network is forced to output predictions at the resolution it was trained on, convolutional DI-Net can produce outputs along a high-quality grid discretization given a coarse QMC discretization, even when it is only trained under QMC output discretizations. 

6. CONCLUSION

DI-Net constitutes the first discretization invariant sampling approach for performing inference directly on neural fields. Motivated by discretization invariance, we designed a parameterization based on numerical integration that gives rise to strong convergence properties and a universal approximation theorem for a wide class of functions. Not only is our framework agnostic to NF parameterization, but it also extends to functions of bounded variation over a wide class of domains, and thus can be applied to many systems that process a continuous signal by querying it on a subset of its domain. We outline several directions for enhancing such models in Appendix F, which could enable them to scale to deeper architectures and tackle more challenging tasks. With the increasing popularity and diversity of neural fields as well as the emergence of tools to efficiently create large datasets of NFs, DI-Nets may become an attractive option when interoperability and discretization invariance are desired. 1 |X| x ′ ∈X f (x ′ ) - Ω f (x) dx ≤ V (f ) D(X), for normalized measure dx, some notion of variation V of the function and some notion of discrepancy D of the point set. The classical inequality gives a tight error bound for functions of bounded variation in the sense of Hardy-Krause (BVHK), a generalization of bounded variation to multivariate functions on [0, 1] d which has bounded variation in each variable. Specifically, the variation is defined as: V HK (f ) = α∈{0,1} d [0,1] |α| ∂ α ∂x α f (x α ) dx, with {0, 1} d the multi-indices and x α ∈ [0, 1] d such that x α,j = x j if j ∈ α and x α,j = 1 otherwise. The classical inequality also uses the star discrepancy of the point set X, given by: D * (X) = sup I∈J 1 |X| x ′ ∈X 1 I (x j ) -λ(I) , ( ) where J is the set of d-dimensional intervals that include the origin, and λ the Lebesgue measure. A point set is called low discrepancy if its discrepancy is on the order of O((ln N ) d /N ). Quasi-Monte Carlo calculates the sample mean using a low discrepancy sequence (see Fig. Then f | Ω is a piecewise smooth function with the Koksma-Hlawka inequality given by variation V (f ) = α∈{0,1} d 2 d-|α| [0,1] d ∂ α ∂x α f (x) dx, and discrepancy: D(X) = 2 d sup I⊆[0,1] d 1 |X| x ′ ∈X 1 Ω∩I (x j ) -λ(Ω ∩ I) . W d,1 functions on manifolds Let M be a smooth compact d-dimensional manifold with normalized measure dx. Given local charts {ϕ k } K k=1 , ϕ k : [0, 1] d → M , the variation of a function f ∈ W d,1 (M ) is characterized as: V (f ) = c K k=1 |α|≤n [0,1] d ∂ α ∂x α (ψ k (ϕ k (x))f (ϕ k (x)) dx, with {ψ k } K k=1 a smooth partition of unity subordinate to the charts, and constant c > 0 that depends on the charts but not on f . Defining the set of intervals in M as J = {U : U = ϕ k (I) for some k and I ⊆ [0, 1] d }, with measure µ(U ) = λ(I), the discrepancy of a point set Y = {y j } N y=1 on M is: D(Y ) = sup U ∈J 1 |X| x ′ ∈X 1 U (y j ) -µ(U ) . Neural fields We define the variation of a neural field as the sum of the variations of each channel.

Note:

The notion of discrepancy is not limited to the Lebesgue measure. The existence of low discrepancy point sets has been proven for non-negative, normalized Borel measures on [0, 1] d due to Aistleitner & Dick (2013) . An extension of our framework to non-uniform measures is a promising direction for future work (see Appendix F).

A.2 MORE GENERAL FORMS OF DI LAYERS

Recall that we defined DI layers as having the form H ϕ [f ] = Ω H ϕ [f ](x) dx for neural fields f (we drop θ here for readability). In the case where H ϕ [f ] : Ω → R n , i.e. H ϕ is a layer that maps neural fields to vectors, we permit layers of the following more general form: H ϕ [f ] = Ω h ϕ (x, f (x), . . . , D α f (x))dµ(x), for weak derivatives up to order |α| taken with respect to each channel. D α f = ∂ |α| f ∂x α 1 1 ...∂x αn n for multi-index α. The dependence of h on weak derivatives up to order k = |α| requires that the weak derivatives are integrable, i.e., f is in the Sobolev space W k,2 (Ω), and that h ϕ is Gateaux differentiable w.r.t. these weak derivatives. Note that a non-uniform measure µ changes the discrepancy of sampled sequences. In the NF-valued case, we can similarly have: H ϕ [f ](x ′ ) = Ω H ϕ [f ](x, x ′ )dµ(x) (17) = Ω h ϕ (x, f (x), . . . , D α f (x), x ′ , f (x ′ ), . . . , D α f (x ′ ))dµ(x), where we require H ϕ [f ](•, x ′ ) ∈ F n . Both of our key theoretical results (convergence of discretized gradients and universal approximation) apply to this general form. Gateaux differentiability of h ϕ allows us to apply the same proof in Appendix B.1 to the derivatives. Since allowing layers to depend on weak derivatives results in an even more expressive class of DI-Nets, the universal approximation theorem still holds.

B PROOFS B.1 PROOF OF THEOREM 1 (CONVERGENCE OF DISCRETIZED GRADIENTS)

A DI-Net permits backpropagation of its outputs with respect to its input as well as all its learnable parameters. Under an equidistributed discretization sequence, the gradients of each layer converge to the appropriate derivative under the measure on Ω. We note that this property automatically holds if the layer does not perform numerical integration. This includes layers which take R n as input, as well as point-wise transformations. Then the (sub)derivatives with respect to inputs and parameters need only be well-defined at each point of the output in order to enable backpropagation. Choose an equidistributed discretization sequence {X N } N ∈N on Ω. We consider a DI layer H ϕ which takes an NF f (we drop dependence on θ for readability) as input and may output a vector or NF. H ϕ [f ] = Ω H ϕ [f ](x)dx. ( ) Recall we write the estimate under X N as: ĤN ϕ [f ] = 1 |X N | x∈X N H ϕ [f ](x). ( ) and call its derivatives the discretized derivatives of H ϕ [f ]. We are interested in proving the convergence of the discretized gradients of H ϕ [f ] with respect to its input f as well as the parameters ϕ of the layer. Definition 2. For given discretization X, the projection π : f → f is a quotient map L 2 (Ω) → L 2 (Ω)/∼ under the equivalence relation f ∼ g iff f (x) = g(x) for all x ∈ X. Thus we can write πf = {f (x)} x∈X when the discretization is clear from the context. Lemma 1. For any DI-Net layer H ϕ which takes an NF f ∈ F m as input, the discretized gradient of H ϕ [f ] w.r.t. its parameters ϕ is convergent in N : lim N →∞ ∇ ϕ ĤN ϕ [f ] < ∞. ( ) Additionally, the discretized gradient of H ϕ [f ] w.r.t. its discretized input πf is convergent in N : lim N →∞ ∇ πf ĤN ϕ [f ] < ∞. Proof. NF to Vector: gradients w.r.t. parameters Consider the case of a layer H ϕ : F m → R n . If ϕ = (ϕ 1 , . . . , ϕ K ), then denote ϕ + τ e k = (ϕ 1 , . . . , ϕ k-1 , ϕ k + τ, ϕ k+1 , . . . , ϕ K ). lim N →∞ ∂ ∂ϕ k ĤN ϕ [f ] = lim N →∞ ∂ ∂ϕ k 1 |X N | x∈X N H ϕ [f ](x) (23) = lim N →∞ lim τ →0   1 τ N N j=1 H ϕ+τ e k [f ](x) -H ϕ [f ](x)   (24) = lim τ →0 1 τ Ω H ϕ+τ e k [f ](x) -H ϕ [f ](x) dx (25) = lim τ →0 H ϕ+τ e k [f ] -H ϕ [f ] τ (26) = ∂ ∂ϕ k H ϕ [f ], where (25) follows by ( 16) and the Moore-Osgood theorem. Thus the discretized gradient converges to the Jacobian of H ϕ w.r.t. each parameter, which is finite by differentiability and boundedness of h ϕ . NF to NF: gradients w.r.t. parameters From (17), we stated that a DI-Net layer Hϕ ′ : L 2 (Ω) → L 2 (Ω) can be expressed as: Hϕ ′ [f ](x ′ ) = Ω h(x, f (x), . . . , D α f (x), x ′ , f (x ′ ), . . . , D α f (x ′ ); ϕ ′ )dx. ( ) We can in fact follow the same steps as the NF to Vector case above, to arrive at: lim N →∞ ∂ ∂ϕ ′ k 1 |X N | x∈X N Hϕ ′ [f ] = ∂ ∂ϕ ′ k Hϕ ′ [f ], with equality at each point x ′ ∈ Ω and channel of the NF. NF input: gradients w.r.t. inputs Here we combine the NF to vector and NF to NF cases for brevity. For fixed x ∈ Ω, the discretized derivative of H ϕ w.r.t. f (x) can be written: ∂ ∂f (x) ĤN ϕ [f ] = ∂ ∂f (x) 1 |X N | x∈X N H ϕ [f ](x) (30) = lim τ →0 1 τ |X N | x∈X N H ϕ [f + τ ψ N x ](x) -H ϕ [f ](x), where ψ N x is any function in W |α|,1 (Ω) that is 1 at x and 0 on X N \{x}, and whose derivatives are 0 on X N . As an example, take the bump function which vanishes outside a small neighborhood of x and smoothly ramps to 1 on a smaller neighborhood of x, making its weak derivative 0 at x.

By (3) we know that the sequences ĤN

ϕ [f ] -H ϕ [f ] and ĤN ϕ [f + τ ψ N x ] -H ϕ [f + τ ψ N x ] converge uniformly in N to 0 for any τ > 0, where we can use the ℓ 1 norm for vector outputs or the L 1 norm for NF outputs. So for any ϵ > 0 and any τ > 0, we can choose N 0 large enough such that for any N > N 0 : ĤN ϕ [f + τ ψ N x ] -H ϕ [f + τ ψ N x ] < ϵ 2 , and ĤN ϕ [f ] -H ϕ [f ] < ϵ 2 . ( ) Then, ĤN ϕ [f + τ ψ N x ] -H ϕ [f + τ ψ N x ] + ĤN ϕ [f ] -H ϕ [f ] < ϵ, by the triangle inequality, ( ĤN ϕ [f + τ ψ N x ] -H ϕ [f + τ ψ N x ]) -( ĤN ϕ [f ] -H ϕ [f ]) < ϵ (35) ĤN ϕ [f + τ ψ N x ] -ĤN ϕ [f ] -H ϕ [f + τ ψ N x ] -H ϕ [f ] < ϵ, ( ) hence ĤN ϕ [f + τ ψ N x ] -ĤN ϕ [f ] -H ϕ [f + τ ψ N x ] -H ϕ [f ] converges uniformly to 0. Since the distance between two vectors is 0 iff they are the same, we can write: lim N →∞ ĤN ϕ [f + τ ψ N x ] -ĤN ϕ [f ] = lim N →∞ H ϕ [f + τ ψ N x ] -H ϕ [f ] (37) lim τ →0 1 τ lim N →∞ ĤN ϕ [f + τ ψ N x ] -ĤN ϕ [f ] = lim τ →0 1 τ lim N →∞ H ϕ [f + τ ψ N x ] -H ϕ [f ] . By the Moore-Osgood theorem, lim N →∞ lim τ →0 1 τ ĤN ϕ [f + τ ψ N x ] -ĤN ϕ [f ] = lim N →∞ lim τ →0 1 τ H ϕ [f + τ ψ N x ] -H ϕ [f ] (39) lim N →∞ ∂ ∂f (x) ĤN ϕ [f ] = lim N →∞ dH ϕ [f ; ψ N x ]. Since h ϕ is Gateaux differentiable and bounded, H ϕ is also Gateaux differentiable for f of bounded variation, hence the limit on the right hand side is finite. For each discretization X N , choose a sequence of functions around each x ∈ X N , {ψ N x } x∈X N . An example of such a family is the (appropriately designed) partitions of unity with |X N | elements. Then the discretized gradient w.r.t. πf converges to the limit of the Gateaux derivatives of H ϕ w.r.t. the bump function sequence as N → ∞. Lemma 2. Chained discretized derivatives converge to the chained Gateaux derivatives. Proof. Consider a two-layer DI-Net with NF input f → (H θ • H ϕ )[f ]. For the case of derivatives w.r.t. the input, we would like to show the analogue of (40): lim N →∞ ∂ ∂f (x) ĤN θ • ĤN ϕ [f ] = lim N →∞ d (H θ • H ϕ ) [f ; ψ N x ], where the bump function ψ N x is defined similarly (1 at x and 0 at each x ̸ = x). ∂ ∂f (x) ĤN θ • ĤN ϕ [f ] = ∂ ∂f (x) H θ ĤN ϕ [f ] (x) (42) = lim τ →0 1 τ H θ ĤN ϕ [f + τ ψ N x ] (x) -H θ ĤN ϕ [f ] (x) as in (31).

By (3) we know H

θ ĤN ϕ [f + τ ψ N x ] -ĤN θ ĤN ϕ [f + τ ψ N x ] converges to 0 in N for all τ > 0 (where we can use the ℓ 1 norm for vector outputs or the L 1 norm for NF outputs), as does H θ ĤN ϕ [f ] -ĤN θ ĤN ϕ [f ] L 1 . Reasoning as in ( 32)-( 39), we have: lim N →∞ lim τ →0 1 τ ĤN θ ĤN ϕ [f + τ ψ N x ] -ĤN θ ĤN ϕ [f ] (44) = lim τ →0 1 τ lim N →∞ H θ ĤN ϕ [f + τ ψ N x ] -H θ ĤN ϕ [f ] (45) = lim N →∞ lim τ →0 1 τ H θ ĤN ϕ [f + τ ψ N x ] -H θ ĤN ϕ [f ] Note that dH ϕ [f ; ψ N x ] = 1 τ H ϕ [f + τ ψ N x ] -H ϕ [f ] + o(τ ) H ϕ [f + τ ψ N x ] = H ϕ [f ] + τ dH ϕ [f ; ψ N x ] + o(τ ). ( ) Then we complete the equality in (41) as follows: LHS = lim N →∞ ∂ ∂f (x) ĤN θ • ĤN ϕ [f ] (49) = lim N →∞ lim τ →0 1 τ H θ ĤN ϕ [f + τ ψ N x ] -H θ ĤN ϕ [f ] (50) = lim N →∞ lim τ →0 1 τ H θ H ϕ [f ] + τ dH ϕ [f ; ψ N x ] -H θ [H ϕ [f ]] (51) = lim N →∞ dH θ H ϕ [f ]; dH ϕ [f ; ψ N x ] (52) = lim N →∞ d (H θ • H ϕ ) ; ψ N x ] (53) = RHS, by the chain rule for Gateaux derivatives. The case of derivatives w.r.t. parameters is straightforward. In the same way we used ( 32)-( 39) to obtain (46), we have: lim N →∞ ∂ ∂ϕ k ( ĤN θ • ĤN ϕ )[f ] = lim N →∞ lim τ →0 1 τ ĤN θ ĤN ϕ+τ e k [f ] -ĤN θ ĤN ϕ [f ] (55) = lim τ →0 1 τ (H θ [H ϕ+τ e k [f ]] -H θ [H ϕ [f ]]) (56) = ∂ ∂ϕ k (H θ • H ϕ )[f ], By induction, the chained derivatives converge for an arbitrary number of layers. Since the properties of DI-Net layers extend to loss functions on DI-Nets, we can treat a loss function similarly to a layer, and write: L g ′ [g] = Ω L[g, g ′ ](x)dx (58) LN g ′ [g] = 1 |X N | x∈X N L[g, g ′ ](x) (59) lim N →∞ ∂ ∂f (x) LN g ′ • ĤN θ • ĤN ϕ [f ] = lim N →∞ d (L g ′ • H θ • H ϕ ) [f ; ψ N x ], where g ′ can be some other input to the loss function such as ground truth labels. Thus, we can state the following result: Corollary 1. The gradients of a DI-Net's loss function w.r.t. its inputs and all its parameters are convergent under an equidistributed discretization sequence.

B.2 PROOF OF THEOREM 2 (UNIVERSAL APPROXIMATION THEOREM)

Note: By our definition of F c (Section 3.1), there exists V * such that every f ∈ F 1 satisfies a Koksma-Hlawka inequality (3) with V (|f |) < V * . F 1 is bounded in L 1 norm since all their functions are compactly supported and bounded. Consider a Lipschitz continuous map R : F 1 → F 1 such that d(R[f ], R[g]) L 1 ≤ M 0 d(f, g) L 1 for some constant M 0 and all f, g ∈ F 1 . Let M = max{M 0 , 1}. Fix a discretization X ⊂ Ω with discrepancy D(X) = ϵ 12(M +2)V * . By (3) this yields: 1 |X| x ′ ∈X f (x ′ ) - Ω f (x) dx ≤ ϵ 12(M + 2) , for all f ∈ F 1 . Let N be the number of points in X. Define the equivalence relation ∼ and projection π as in Definition 2. L 2 (Ω)/∼ is isomorphic to R |X| , and thus can be given the normalized ℓ 1 norm: ∥πf ∥ ℓ 1 = 1 |X| x ′ ∈X |f (x ′ )|. ( ) Definition 3. Denote the preimage of π as π -1 : f ′ → {f ′ ∈ F 1 : πf ′ = f ′ }. Invoking the axiom of choice, define the inverse projection π -1 : πF 1 → F 1 by a choice function over the sets π -1 (πF 1 ). Note that this inverse projection corresponds to some way of interpolating the N sample points such that the output is in F 1 . Although our definition implies the existence of such an interpolator, we leave its specification as an open problem. Since Ω only permits discontinuities along a fixed Borel subset of [0, 1] d , these boundaries can be specified a priori in the interpolator. Since all functions in F 1 are bounded and continuous outside this set, the interpolator can be represented by a bounded continuous map, hence it is expressible by a DI-Net layer. Definition 4. π generates a σ-algebra on F 1 given by A = {π -1 (S) : S ∈ L }, with L the σalgebra of Lebesgue measurable sets on R N . Because this σ-algebra depends on ϵ and the Lipschitz constant of R via the point set's discrepancy, we may write it as A ϵ,R . In this formulation, we let the tolerance ϵ and the Lipschitz constant of R dictate what subsets of F 1 are measurable, and thus which measures on F 1 are permitted. However, if the desired measure ν is more fine-grained than what is permitted by A ϵ,R , then it is ν that should determine the number of sample points N , rather than ϵ or R. We now state the following lemmas which will be used to prove our universal approximation theorem. Lemma 3. There is a map R : πF 1 → πF 1 such that Ω R[f ](x) -π -1 • R • π[f ](x) dx = ϵ 6 . ( ) Proof. Let g(x) = |f (x)| for f ∈ F 1 . Because (61) applies to g(x), we have: 1 |X| x ′ ∈X g(x ′ ) - Ω g(x) dx ≤ ϵ 12(M + 2) (64) ∥πf ∥ ℓ 1 -∥f ∥ L 1 ≤ ϵ 12(M + 2) . ( ) Eqn. ( 65) also implies that for any f ∈ πF 1 , we have: ∥f ∥ ℓ 1 -π -1 f L 1 ≤ ϵ 12(M + 2) . ( ) Combining ( 65) and ( 66), we obtain ∥f ∥ L 1 -π -1 • π[f ] L 1 ≤ ϵ 6(M + 2) . ( ) By the triangle inequality and applying R: Ω R[f ](x) -π -1 • π • R[f ](x) dx ≤ ϵ 6(M + 2) . ( ) For any f, g ∈ F 1 such that πf = πg, (65) tells us that d(f, g) L 1 is at most ϵ/6(M + 2). Recall M was defined such that d (R[f ], R[g]) L 1 ≤ M d(f, g) L 1 for any R. d(π • R[f ], π • R[g]) L 1 ≤ M ϵ 6(M + 2) + ϵ 6(M + 2) (69) = (M + 1) (M + 2) ϵ 6 (70) So defining: R = arg min H d(H • π[f ], π • R[f ]) ℓ 1 , (71) we have R • π[f ] -π • R[f ] ≤ (M + 1) (M + 2) ϵ 6 . ( ) Then by ( 68), Ω R[f ](x) -π -1 • R • π[f ](x) dx ≤ ϵ 6(M + 2) + (M + 1) (M + 2) ϵ 6 (73) = ϵ 6 . ( ) Lemma 4. Consider the extension of R to R N → R N in which each component of the output has the form: Rj (f ) = R[π -1 f ](x) if f ∈ πF 1 0 otherwise. ( ) Then any finite measure ν on the measurable space (F 1 , A ) induces a finite measure µ on (R N , L ), and R N | Rj (f )|µ(df ) < ∞ for each j. Proof. Since the σ-algebra A on F 1 is generated by π, the measure µ : µ(πS) = ν(S) for all S ∈ A is finite and defined w.r.t. the Lebesgue measurable sets on πF 1 . Since πF 1 can be identified with a measurable subset of R N , µ can be naturally extended to R N . Doing so makes it absolutely continuous w.r.t. the Lebesgue measure on R N . To show Rj (f ) is integrable, it is sufficient to show it is bounded and compactly supported. F 1 is bounded in the L 1 norm. Thus by ( 65), πF 1 is bounded in the normalized ℓ 1 norm. The ℓ 1 norm in R N is strongly equivalent to the uniform norm, so there is some compact set [-c, c] N , c > 0 for which the extension of πF 1 to R N vanishes, so supp( Rj (f )) ⊆ [-c, c] N . Similarly, πF 1 is bounded in the ℓ 1 norm, hence there exists c ′ such that Rj < c ′ for all j. Lemma 5. For any finite measure µ absolutely continuous w.r.t. the Lebesgue measure on R n , J ∈ L 1 (µ) and ϵ > 0, there is a network K such that: R n |J(f ) -K(f )| µ(df ) < ϵ 2 . ( ) Proof. The following construction is adapted from Lu et al. (2017) . Since J is integrable, there is a cube E = [-c, c] n such that: R n \E |J(f )|µ(df ) < ϵ 8 (77) ∥J -1 E J∥ 1 < ϵ 8 . ( ) Case 1: J is non-negative on all of R n Define the set under the graph of J| E : G E,J ≜ {(f , y) : f ∈ E, y ∈ [0, J(f )]}. (79) G E,J is compact in R n+1 , hence there is a finite cover of open rectangles {R ′ i } satisfying µ(∪ i R ′ i ) - µ(G E,J ) < ϵ 8 on R n . Take their closures, and extend the sides of all rectangles indefinitely. This results in a set of pairwise almost disjoint rectangles {R i }. Taking only the rectangles R = {R i : µ(R i ∩ G E,J ) > 0} results in a finite cover satisfying: |R| i=1 µ(R i ) -µ(G E,J ) < ϵ 8 . ( ) This implies: |R| i=1 µ(R i ) < ∥J∥ 1 + ϵ 8 , and also, ϵ 8 > |R| i=1 R n 1 Ri (f , J(f )) µ(df ) + ∥J∥ 1 (82) ≥ E |J(f ) - |R| i=1 1 Ri (f , J(f ))| µ(df ), by the triangle inequality. For each R i = [a i1 , b i1 ] × . . . [a in , b in ] × [ζ i , ζ i + y i ] , let X i be its first n components (i.e., the projection of R i onto R n ). Then we have E |J(f ) - |R| i=1 y i 1 Xi (f )| µ(df ) < ϵ 8 . ( ) Let Y (f ) ≜ |R| i=1 y i 1 Xi (f ). By the triangle inequality, R n |J(f ) -K(f )| µ(df ) ≤ ∥J -1 E J∥ 1 + ∥1 E J -Y ∥ 1 + ∥K -Y ∥ 1 (85) < ϵ 4 + ∥K -Y ∥ 1 , by ( 78) and ( 84). So it remains to construct K such that ∥K -Y ∥ 1 < ϵ 4 . Because 1 Xi is discontinuous at the boundary of the rectangle X i , it cannot be produced directly from a DI-Net (recall that all layers are continuous maps). However, we can approximate it arbitrarily well with a piece-wise linear function that rapidly ramps from 0 to 1 at the boundary. For fixed rectangle X i and δ ∈ (0, 0.5), consider the inner rectangle X δ ⊂ X i : X δ = (a 1 + δ(b 1 -a 1 ), b 1 -δ(b 1 -a 1 )) × • • • × (a n + δ(b n -a n ), b n -δ(b n -a n )), where we omit subscript j for clarity. Letting b ′ i = b i -δ(b i -a i ), define the function: T (f ) = n i=1 1 δ ReLU(δ -ReLU(f i -b ′ i )) -ReLU(δ -ReLU(f i -a i )) , where ReLU(x) = max(x, 0). T (f ) is a piece-wise linear function that ramps from 0 at the boundary of X i to 1 within X δ , and vanishes outside X i . Note that ∥1 X -T ∥ 1 < µ(X) -µ(X δ ) (89) = (1 -(1 -2δ) n )µ(X), if µ is the Lebesgue measure. δ may need to be smaller under other measures, but this adjustment is independent of the input f so it can be specified a priori. Recall that the function we want to approximate is Y (f ) = |R| i=1 y i 1 Xi (f ). We can build NF-Net layers K : f → K(f ) = |R| i=1 y i T i (f ) , since this only involves linear combinations and ReLUs. Then, ∥K -Y ∥ 1 = R n |R| i=1 y i (T i (f ) -1 Xi (f )) df (91) = |R| i=1 y i ∥1 Xi -T i ∥ 1 (92) < (1 -(1 -2δ) n ) |R| i=1 y i µ(X i ) (93) = (1 -(1 -2δ) n ) |R| i=1 µ(R i ) < (1 -(1 -2δ) n ) ∥J∥ 1 + ϵ 8 , by ( 81). And so by choosing: δ = 1 2 1 -1 - ϵ 4 ∥J∥ 1 + ϵ 8 -1 1/n , we have our desired bound ∥K -Y ∥ 1 < ϵ 4 and thereby ∥J -K∥ 1 < ϵ 2 . Case 2: J is negative on some region of R n Letting J + (f ) = max(0, J(f )) and J -(f ) = max(0, -J(f )), define: G + E,J ≜ {(f , y) : f ∈ E, y ∈ [0, J + (f )]} G - E,J ≜ {(f , y) : f ∈ E, y ∈ [0, J -(f )]}. As in (80), construct covers of rectangles R + over G + E,J and R -over G - E,J each with bound ϵ 16 and R n projections X + , X -. Let: Y + (f ) = |R + | i=1 y + i 1 X + i (f ) Y -(f ) = |R -| i=1 y - i 1 X - i (f ) Y = Y + -Y - We can derive an equivalent expression to (84): ϵ 8 > E |J(f ) - |R + | i=1 y + i 1 X + i (f ) + |R -| i=1 y - i 1 X - i (f )| df (102) = ∥1 E J -Y ∥ 1 . Similarly to earlier, we use ( 78) and ( 103) to get: R n |J(f ) -K(f )| df < ϵ 4 + ∥K -Y ∥ 1 . Choosing T + i (f ) and T - i (f ) the piece-wise linear functions associated with X + i and X - i , and: K(f ) = |R + | i=1 y + i T + i (f ) - |R -| i=1 y - i T - i (f ), we have: ∥K -Y ∥ 1 = R n |R + | i=1 y + i T + i (f ) -1 X + i (f ) - |R -| i=1 y - i T - i (f ) -1 X - i (f ) df , applying the triangle inequality, ≤ |R + | i=1 y + i 1 X + i -T + i 1 + |R -| i=1 y - i 1 X - i -T - i 1 (107) < (1 -(1 -2δ + ) n ) |R + | i=1 y + i µ(X + i ) + (1 -(1 -2δ -) n ) |R -| i=1 y - i µ(X - i ) < (1 -(1 -2δ + ) n ) J + 1 + ϵ 16 + (1 -(1 -2δ -) n ) J - 1 + ϵ 16 . By choosing: δ + = 1 2 1 -1 - ϵ 8 J + 1 + ϵ 16 -1 1/n (110) δ -= 1 2 1 -1 - ϵ 8 J - 1 + ϵ 16 -1 1/n , and proceeding as before, we arrive at the same bounds ∥K -Y ∥ 1 < ϵ 4 and ∥J -K∥ 1 < ϵ 2 . Putting it all together, Algorithm 1 implements the network logic for producing the function K. 80) and R - similarly; Algorithm 1: DI-Net approximation of f → J(f ) Setup; Input: target function J, L 1 tolerance ϵ/2 Choose rectangles R + i = [a + i1 , b + i1 ] × . . . [a + in , b + in ] × [ζ + i , ζ + i + y + i ] satisfying ( δ + ← 1 2 1 -(1 -ϵ 8 ∥J + ∥ 1 + ϵ 16 -1 ) 1/n ; δ -← 1 2 1 -(1 -ϵ 8 ∥J -∥ 1 + ϵ 16 -1 ) 1/n ; Inference; Input: discretized input f = {f k } n k=1 x ← (0, 0, 1, 0, 0); for rectangle R + i ∈ R + do for dimension k ∈ 1 : n do x ← (f k -b + ik + δ(b + ik -a + ik ), f k -a + ik , x 3 , x 4 , x 5 ); x ← ReLU(x); x ← (δ -x 1 , δ -x 2 , x 3 , x 4 , x 5 ); x ← ReLU(x); x ← (0, 0, x 3 (x 1 -x 2 )/δ, x 4 , x 5 ); end x ← (0, 0, 1, y + i x 3 + x 4 , x 5 ); end for rectangle R - i ∈ R -do for dimension k ∈ 1 : n do x ← (f k -b - ik + δ(b - ik -a - ik ), f k -a - ik , x 3 , x 4 , x 5 ); . . . ; end x ← (0, 0, 1, x 4 , y - i x 3 + x 5 ); end Output: x 4 -x 5 We can provide x with access to f either through skip connections or by appending channels with the values {c + f k } n k=1 (which will be preserved under ReLU). Theorem 3 (Maps between Single-Channel NFs). For any Lipschitz continuous map R : F 1 → F 1 , any ϵ > 0, and any finite measure ν w.r.t. the measurable space (F 1 , A ϵ,R ), there exists a DI-Net T that satisfies: F1 ∥R(f ) -T (f )∥ L 1 (Ω) ν(df ) < ϵ. Proof. If ν is not normalized, the discrepancy of our point set needs to be further divided by max{ν(F 1 ), 1}. We assume for the remainder of this section that ν is normalized. Perform the construction of Lemma 5 N times, each with a tolerance of ϵ/2N K, where K is the Lipschitz constant of R. Choose a partition of unity {ψ j } N j=1 for which ψ j (x) = 1 [x k = arg min x ′ ∈X d(x, x ′ )], and output N channels with the values {K j (f )ψ j (•)} N j=1 . By summing these channels we obtain a network K that fully specifies the desired behavior of R : R N → R N , with combined error: R N R(f ) -K(f ) ℓ 1 µ(df ) < ϵ 2 . ( ) Thus, F1 1 |X| x ′ ∈X R • π[f ](x ′ ) -K • π[f ](x) ν(df ) ≤ ϵ 2 . By (66) we have: F1 Ω π -1 • R • π[f ](x) -π -1 • K • π[f ](x) dx ν(df ) ≤ ϵ 2 + ϵ 6(M + 2) By Lemma 3 we have: F1 Ω R[f ](x) -π -1 • K • π[f ](x) dx ν(df ) ≤ ϵ 2 + ϵ 6(M + 2) + ϵ 6 And thus the network T = π -1 • K • π gives us the desired bound: F1 ∥R(f ) -T (f )∥ L 1 (Ω) ν(df ) < ϵ. Corollary 2 (Maps from NFs to vectors). For any Lipschitz continuous map R : F 1 → R n , any ϵ > 0, and any finite measure ν w.r.t. the measurable space (F 1 , A ϵ,R ), there exists a DI-Net T that satisfies: F1 ∥R(f ) -T (f )∥ ℓ1(R n ) ν(df ) < ϵ. Proof. Let M 0 be the Lipschitz constant of R in the sense that d (R[f ], R[g]) ℓ 1 ≤ M 0 d(f, g) L 1 . Let M = max{M 0 , 1}. There exists R : πF 1 → R n such that R • π[f ] -R[f ] ℓ 1 ≤ ϵ/12. As in Lemma 4, consider the extension of R to R N → R n in which each component of the output has the form: Rj (f ) = R[π -1 f ] j if f ∈ πF 1 0 otherwise. ( ) Then for similar reasoning, ν on F 1 induces a measure µ on R N that is finite and absolutely continuous w.r.t. the Lebesgue measure, and R N | Rj (f )|µ(df ) < ∞ for each j. The multi-channel analogue of Corollary 2 is clear, and we state it here for completeness: Corollary 5 (Maps from multi-channel NFs to vectors). For any Lipschitz continuous map R : F n → R m , any ϵ > 0, and any finite measure ν w.r.t. the measurable space (F n , A ′ ϵ,R ), there exists a DI-Net T that satisfies: Fn ∥R(f ) -T (f )∥ ℓ1(R m ) ν(df ) < ϵ. C PIXEL-BASED DI-NET LAYERS Here we present a variety of layers that show how to generalize pixel-based networks (convolutional neural networks and vision transformers) to DI-Net equivalents. Many of the following layers were not directly used in our experiments, and we leave an investigation of their properties for future work. We use c in to denote the number of channels of an input NF and c out to denote the number of channels of an output NF.

Convolution layer

The convolution layer aggregates information locally and across channels. It has c in c out learned filters K ij , which are defined on some support S which may be a ball or orthotope. The layer also learns scalar biases b j for each output channel: g j = cin i=1 K ij * f i + b j , with * the continuous convolution as in (8). To transfer weights from a discrete convolutional layer, K can be parameterized as a rectangular B-spline surface that interpolates the weights (Fig. 2 left). To replicate the behavior of a discrete convolution layer with odd kernel size, S is zero-centered. For even kernel size, we shift S by half the dimensions of a pixel. We use a 2nd order B-spline for 3 × 3 filters and 3rd order for larger filters. We use deBoor's algorithm to evaluate the spline at intermediate points. Strided convolution is implemented by simply truncating the output discretization to the desired factor as described in Section 4. Different padding behaviors from the discrete case are treated differently. Zero-padding is replicated by scaling H[f ](x) by |(S+x)∩Ω|

S+x

where S + x is the kernel support S translated by x. For reflection padding, the value of the NF at points outside its domain are calculated by reflection. For no padding, the NF's domain is reduced accordingly, dropping all sample points that are no longer on the new domain. Linear combinations of channels Linear combinations of channels mimic the function of 1 × 1 convolutional layers in discrete networks. For learned scalar weights W ij and biases b j : g j (x) = cin i=1 W ij f i (x) + b j , for all x ∈ Ω. These weights and biases can be straightforwardly copied from a 1 × 1 convolutional layer to obtain the same behavior. One can also adopt a normalized version, sometimes used in attention-based networks: W ij = w ij cin k=1 w kj (130) Normalization All forms of layer normalization readily generalize to the continuous setting by estimating the statistics of each channel with numerical integration, then applying point-wise operations. These layers typically rescale each channel to have some mean m i and standard deviation s i . µ i = Ω f i (x)dx σ 2 i = Ω f i (x) 2 dx -µ 2 i (132) g i (x) = f i (x) -µ i σ i + ϵ × s i + m i , where we assume dx is normalized and ϵ > 0 is a small constant. Just as in the discrete case, µ i and σ 2 i can be a moving average of the means and variances observed over the course of training different NFs, and they can also be averaged over a minibatch of NFs (batch normalization) or calculated per datapoint (instance normalization). Mean m i and standard deviation s i can be learned directly (batch normalization), conditioned on other data (adaptive instance normalization), or fixed at 0 and 1 respectively (instance normalization). These layers are not discretization invariant in the sense of Definition 1, since the output can be poorly behaved for small values of σ i , but the convergence condition still holds, i.e. normalization is convergent under an equidistributed sequence of discretizations. Max pooling There are two natural generalizations of the max pooling layer to a collection of points: 1) assigning each point to the maximum of its k nearest neighbors, and 2) taking the maximum value within a fixed-size window around each point. However, both of these specifications change the output's behavior as the density of points increases. In the first case, nearest neighbors become closer together so pooling occurs over smaller regions where there is less total variation in the NF. In the second case, the empirical maximum increases monotonically as the NF is sampled more finely within each window. Because we may want to change the number of sampling points on the fly, both of these behaviors are detrimental. If we consider the role of max pooling as a layer that shuttles gradients through a strong local activation, then it is sufficient to use a fixed-size window with some scaling factor that mitigates the impact of changing the number of sampling points. Consider the following simplistic model: assume each point in a given patch of an NF channel is an i.i.d. sample from U ([-b, b] ). Then the maximum of N samples {f i (x j )} N j=1 is on average N -1 N +1 b. So we can achieve an "unbiased" max pooling layer by taking the maximum value observed in each window and scaling it by N +1 N -1 (if N = 1 or our empirical maximum is negative then we simply return the maximum), then (optionally) multiplying a constant to match the discrete layer. To replicate the behavior of a discrete max pooling layer with even kernel size, we shift the window by half the dimensions of a pixel, just as in the case of convolution. Tokenization A tokenization layer chooses a finite set of non-overlapping regions ω j ⊂ Ω of equal measure such that ∪ j ω j = Ω. We apply the indicator function of each set to each channel f i . An embedding of each f i | ωj into R n can be obtained by taking its inner product with a polynomial function whose basis spans each L 2 (ω j ). To replicate a pre-trained embedding matrix, we interpolate the weights with B-spline surfaces. Average pooling An average pooling layer performs a continuous convolution with a box filter, followed by downsampling. To reproduce a discrete average pooling with even kernels, the box filter is shifted, similarly to max pooling. An adaptive average pooling layer can be replicated by tokenizing the NF and taking the mean of each token to produce a vector of the desired size. Attention layer There are various ways to replicate the functionality of an attention layer. Here we present an approach that preserves the domain. For some d k ∈ N consider a self-attention layer with c in d k parametric functions q ij ∈ L 2 (Ω), c in d k parametric functions k ij ∈ L 2 (Ω) , and a convolution with d k output channels, produce the output NF g as: Q j = ⟨q ij , f i ⟩ (134) K j = ⟨k ij , f i ⟩ (135) V [f ] = cin i=1 v ij * f i + b j (136) g(x) = softmax QK T √ d k V [f ](x) A cross-attention layer generates queries from a second input NF. A multihead-attention layer generates several sets of (Q, K, V ) triplets and takes the softmax of each set separately. Data augmentation Most data augmentation techniques, including spatial transformations, pointwise functions and normalizations, translate naturally to NFs. Furthermore, spatial transformations are efficient and do not incur the usual cost of interpolating back to the grid. Thus DI-Nets might be suitable for a new set of data augmentation methods such as adding Gaussian noise to the discretization coordinates. Positional encoding Given their central role in neural fields, positional encodings (adding sinusoidal functions of the coordinates to each channel) would likely play an important role in helping pixelbased DI-Nets capture high-frequency information under a range of discretizations. Vector decoders (R n → F c ) and parametric functions in F c A vector can be expanded into an NF in several ways. We can create an NF that simply places input values at fixed coordinates and produces values at all other coordinates by interpolation. Alternatively, we can define a parametric function that spans F c using the input vector as the parameters, for example by taking as input n numbers and treating them as coefficients of the first n elements of an orthonormal polynomial basis on Ω. If Ω is a subset of [a, b] d , one can use a separable basis defined by the product of rescaled 1D Legendre polynomials along each dimension. If Ω is a d-ball, we can use the Zernike polynomial basis. For a general coordinate system, a small MLP could be used where R n can represent its parameters or a learned lower-dimensional modulation (Dupont et al., 2022) of its parameters. Beyond using such parametric functions as vector decoder layers, they also give rise to n-parameter layers that compute an inner product ("learned global pooling layer") or elementwise product ("dense modulation layer") of an input NF with the learned functions. Warp layer Layers that apply a self-homeomorphism q on Ω (a bicontinuous map from Ω → Ω) preserve discretization invariance since it simply modifies the upper bound of the invariance error in subsequent layers to use a discrepancy of q(X) rather than a discrepancy of X. 

D EXPERIMENTAL DETAILS

f i , point labels (x ij , y ij ) ← minibatch(D) Output NFs g i ← T θ [f i ] Point label estimates ŷij ← g i (x ij ) Update θ based on ∇ θ L(ŷ ij , y ij ) end Output: trained network T θ D.2 DETAILS ON IMAGENET CLASSIFICATION We split ImageNet1k into 12 superclasses (dog, structure/construction, bird, clothing, wheeled vehicle, reptile, carnivore, insect, musical instrument, food, furniture, primate) based on the big 12 dataset (Engstrom et al., 2019) , which is in turn derived from the WordNet hierarchy. The hypernetwork learns a map from the SIREN RGB to a SIREN with the same architecture that represents the segmentation. It predicts changes to the weights of all layers before the final fully connected layer, and predicts raw values for the weights of the final layer since it has 7 output channels for segmentation instead of 3 for RGB. The non-uniform CNN applies the non-uniform Fourier transform followed by inverse Fast Fourier Transform, and feeds the result to the 3-layer FCN to perform segmentation.

D.4 SIGNED DISTANCE FUNCTIONS

In our SDF prediction experiment, we construct toy scenes of 2-4 balls of random radii (range 0.2-0.5), centers, and colors scattered in 3D space (Ω = [-1, 1] 3 ). For simplicity, we train each network directly on the closed form expressions for the RGBA fields and signed distance functions, rather than fitting neural fields first. The FCN contains 3 convolutional layers of kernel lengths 3, 5 and 1 respectively. Accordingly, the convolutional DI-Net contained 2 convolutional layers followed by a linear combination layer. There are 8 channels in all intermediate features. We train each network for 1000 iterations with an AdamW optimizer with a batch size of 64 and a learning rate of 0.1 with an MSE loss on the SDF.

E ADDITIONAL ANALYSIS E.1 INITIALIZATION WITH DISCRETE NETWORKS

When a DI-Net is initialized with a large pre-trained convolutional neural network, its outputs are identical by construction. However, the behavior of the pre-trained CNN is not preserved when the DI-Net switches to other discretizations -even tiny perturbations from the regular grid are sufficient to change a classifier's predictions. Although the effect on the output of a single layer is much lower than the signal, small differences in each layer accumulate to exert a large influence on the final output. Figure E.2 illustrates this phenomenon for DI-Net initialized with a truncated EfficientNet. In addition, we find that once grid discretization is abandoned, large DI-Nets cannot easily be fine-tuned to restore the behavior of the discrete network used to initialize it. This suggests that not only does the discretization at training time not necessarily permit new discretizations at inference time, but also that the optimization landscape of maps on L 2 /I X → R can vary significantly with X. 

Model Type

Mean IoU Pixel Accuracy ConvNexT (Liu et al., 2022) 0.429 68.1% DI-Net-CN 0.376 68.7% In Tables E We also find that the output of an NF-Net is less stable under changing sampling resolution with a grid pattern (Fig. E .1). While the output of a network with QMC sampling converges at high resolution, the grid sampling scheme has unstable outputs until very high resolution. Only the grids that overlap each other (resolutions in powers of two) produce similar activations. Our preliminary experience with DI-Nets highlights the need for improved sampling schemes and parameterizations that will allow large continuous-domain neural networks to learn effectively. Stable, scalable methods are needed to realize DI-Nets' full potential for continuous data analysis.

E.2 COMPUTATIONAL COMPLEXITY

The DI-Net's complexity is similar to that of a discrete model with an equivalent architecture. In general time and memory both scale linearly with the number of sample points (regardless of the dimensionality of Ω), as well as with network depth and width. Implemented naively, the computational cost of the continuous convolution is quadratic in the number of sample points, as it must calculate weights separately for each neighboring pair of points. We can reduce this to a linear cost by specifying a N bin Voronoi partition of the kernel support B, then using the value of the kernel at each seed point for all points in its cell. Thus the kernel need only be evaluated N bin times regardless of the number of sample points. Additionally N bin can be modified during training and inference. DI-Net-4 (our ImageNet classifier) performs a forward pass on a batch of 48 images in 96 ± 4ms on a single NVIDIA RTX 2080 Ti GPU.

F FUTURE DIRECTIONS

Scaling convolutional DI-Nets Our initial experiments suggest that convolutional DI-Nets do not scale well in depth. We suspect that within a CNN-like architecture, the gradients of discrete convolutional layers with respect to kernel parameters have much smoother optimization landscapes over large networks relative to continuous convolutional layers parameterized with MLPs or coefficients of a polynomial basis. It is then no surprise that implementations of neural networks with continuous convolutions do not simply substitute the convolutional layers in a standard CNN architecture, but also make use of a variety of additional techniques (Qi et al., 2017; Wang et al., 2021; Boulch, 2019) which would likely be helpful for scaling convolutional DI-Nets.

Parameterization of output NFs

In this work we assume that a DI-Net that produces an NF specifies the output discretization a priori, but some applications may need the output to be sampled several times at different discretizations. It is inefficient to re-evaluate the entire network in such cases, and we propose two solutions for future work. One method can store the last few layers of the network alongside the input activation, and adapt the discretizations as needed in these last few layers only. A second approach can treat the discretized outputs of DI-Net as parameters of the output NF in the manner of Vora et al. (2021) , which would maintain interoperability of the entire framework. Extending DI-Net to high discrepancy sequences In many applications, there are large regions of the domain that are less informative for the task of interest. For example, most of the information in 3D scenes is concentrated at object surfaces, so DI-Nets should not need to process a NeRF by densely sampling all 5 dimensions. Moreover, ground truth labels for dense prediction tasks may only be available along a high discrepancy discretization. Such a discretization can be handled by quadrature, but more work is required to design efficient quadrature methods within a neural network. Additional techniques such as learned coordinate transformations or learned discretizations may also be helpful for extending our model to extreme discretizations or highly non-uniform measures. Error propagation When an NF does not faithfully represent the underlying data, it is important to characterize the influence on DI-Net's output. In the worst case, these deviations are adversarial examples, and robustness techniques for discrete networks can also be applied to DI-Net. But what can we say about typical deviations of NFs? Future work should analyze patterns in the mistakes that different types of NFs make, and how to make DI-Nets robust to these.



is bounded by the product of V (f θ ) and the discrepancy 1 The parameterization of multi-layer perceptrons guarantees that NFs are square integrable and of bounded variation over a compact domain. We assume that neural field evaluation yields pointwise values, i.e. the point spread function of the underlying signal is a delta function. It is possible to accommodate non-trivial point spread functions, but this is beyond the scope of this work. Note that the action of the layer produces a neural field with parameters (θ, ϕ). In Appendix F, we discuss how to reparameterize NFs formed by a sequence of DI layers to control its parameter size. Any equidistributed sequence of points generates an equidistributed discretization sequence by truncating to the first N terms, although the class of all equidistributed discretization sequences is much larger than this. Quasi-Monte Carlo sampling can efficiently generate equidistributed sequences on a wide range of domains.



Figure 2: Convolutional DI-Nets generalize convolutional neural networks to arbitrary discretizations of the domain. Low discrepancy point sets used in quasi-Monte Carlo integration are amenable to the multi-scale structures often found in discrete networks. Convolutional DI-Nets may be initialized directly from pre-trained CNNs.

Figure 4: Cityscapes NF segmentations for models trained on coarse segmentations only. NF-Net produces NF segmentations, which can be evaluated at the subpixel level.

A.1 for examples in 2D), as opposed to the i.i.d. point set generated by standard Monte Carlo, which will generally be high discrepancy. Because the Koksma-Hlawka inequality is sharp, when estimating the integral of a BVHK function on [0, 1] d , the error of the QMC approximation decays as O((ln N ) d /N ), in contrast to the error of the standard Monte Carlo approximation that decays as O(N -1/2 )(Caflisch, 1998).

Figure A.1: Examples of low and high discrepancy sequences in 2D.However, BVHK is a rather restrictive class of functions defined on [0, 1] d that excludes all functions with discontinuities.Brandolini et al. (2013) extended the Koksma-Hlawka inequality to two classes of functions defined below:

Figure D.2: 2D slices of two toy 3D scenes with signed distance functions predicted by DI-Net and a fully convolutional network.

.1 and E.2, we illustrate that DI-Net initialized with a large pre-trained discrete network does not match the performance of the original model when fine-tuned with QMC sampling. We use a truncated version of EfficientNet (Tan & Le, 2019) for classification, and fine-tune for 200 samples per class. For segmentation we use a truncated version of ConvNexT-UPerNet (Liu et al., 2022), fine-tuning with 1000 samples.

Figure E.1: Distance of the output of a DI-Net from its grid output at 32 × 32 resolution, when sampling at various resolutions. Its outputs deviate rapidly as its discretization shifts from a regular grid to a low discrepancy sequence Figure E.2: An DI-Net's output diverges as sample points are gradually shifted from a grid layout to a low discrepancy sequence.

Accuracy of 2-layer DI-Net under various discretizations.

Segmentation performance on NFs fit to Cityscapes images (trained on coarse segs).

Mean squared error (×10 -2 ) of predicted SDFs under different discretizations. Top 3 settings bolded. MC=Monte Carlo. These observations illustrate the complex task-dependent interplay between the type of discretizations observed at training time and the ability of the model to generalize to new discretizations.

Algorithm 3: Dense Prediction Training Input: network T θ , dataset D with dense coordinate-label pairs, task-specific loss L for step s ∈ 1 : N steps do NFs

1: Pre-trained models fine-tuned on ImageNet NF classification.

2: Pre-trained models fine-tuned on Cityscapes segmentation.

Appendix

Appendix A provides additional background on the variation of a function and discrepancy of a point set, as well as more general forms of DI layers. Appendix B provides proofs of the Universal Approximation and Convergent Empirical Gradients theorems, as well as extensions of these properties from the single-channel case stated in the main text to multi-channel maps. Appendix C provides a detailed specification of DI-Net layers that enable DI-Nets to replicate the behavior of grid-based networks. Appendix D provides additional details of the data and architectures used in our classification experiment. Appendix E provides several analyses including properties of DI-Nets under different discretizations. Appendix F describes limitations and directions for future work.

A MORE DETAILS ON DISCRETIZATION INVARIANCE

A.1 KOKSMA-HLAWKA INEQUALITY AND LOW DISCREPANCY SEQUENCES Recall that a function f ∈ L 2 (Ω) satisfies a Koksma-Hlawka inequality if for any point set X ⊂ Ω, We construct our R N → R approximation n times with a tolerance of ϵ/2n, such that:Applying (65), we find that the network T = K • π gives us the desired bound:Corollary 3 (Maps from vectors to NFs). For any Lipschitz continuous map R : R n → F 1 and any ϵ > 0, there exists a DI-Net T that satisfies:Proof. Define the map R :Applying ( 66), we find that the network T = π -1 • K gives us the desired bound:Denote the space of multi-channel NFs asDenote the norm on this space as:The concatenation of NFs can be defined inductively to yield F n × F m → F n+m for any n, m ∈ N.All maps F n × F m → F c can be expressed as a concatenation followed by a map F n+m → F c . A map R n → F m is also equivalent to m maps R n → F 1 followed by concatenation. Thus, we need only characterize the maps that take one multi-channel NF as input.Considering the maps F n → F m , we choose a lower discrepancy point set X on Ω such that the Koksma-Hlawka inequality yields a bound of ϵ/12mn(M + 2). Let π project each component of the input to πF 1 , and π -1 inverts this projection under some choice function. We take A ′ to be the product σ-algebra generated from this π:Corollary 4 (Maps between multi-channel NFs). For any Lipschitz continuous map R : F n → F m , any ϵ > 0, and any finite measure ν w.r.t. the measurable space (F n , A ′ ϵ,R ), there exists a DI-Net T that satisfies:Proof. The proof is very similar to that of Theorem 3. Our network now requires nN maps from R mN → R each with error ϵ/2mnN . Summing the errors across all input and output channels yields our desired bound.We fit SIREN (Sitzmann et al., 2020b) to each image in ImageNet using 5 fully connected layers with 256 channels and sine non-linearities, trained for 2000 steps with an Adam optimizer at a learning rate of 10 -4 . It takes coordinates on [-1, 1] 2 and produces RGB values in [-1, 1] 3 . We fit Gaussian Fourier feature (Tancik et al., 2020b) We found that the model's loss curve becomes unstable after 3000 iterations so we reduce the number of iterations to 2000.The non-uniform CNN applies the non-uniform Fourier transform (Muckley et al., 2020) followed by inverse Fast Fourier Transform to resample the input signal to the grid. It then feeds the result to a 2-layer CNN to perform classification.During training, we augment with noise, horizontal flips, and coordinate perturbations.

D.3 DETAILS ON CITYSCAPES SEGMENTATION

SIREN is trained on Cityscapes images for 2500 steps, using the same architecture and settings as ImageNet. Seven segmentation classes are used for training and evaluation, labeled in the dataset as 'flat' (e.g. road), 'construction' (e.g. building), 'object' (e.g. pole), 'nature', 'sky', 'human', and 'vehicle'.DI-Net-3 uses two MLP convolutional layers at the same resolution followed by channel mixing (pointwise convolution). There are 16, 32 and 32 channels in the intermediate features. The support of the kernels in the MLP convolutional layers is .025 × .05 and .075 × .15 respectively, to account for the wide images in Cityscapes being remapped to [-1, 1] 2 . DI-Net-5 uses a strided MLP convolution to perform downsampling and nearest neighbor interpolation for upsampling. There are 16 channels in all intermediate features. There is a residual connection between the higher resolution layers.

