FUNCTIONAL RISK MINIMIZATION

Abstract

In this work, we break the classic assumption of data coming from a single function f θ * (x) followed by some noise in output space P(y|f θ * (x)). Instead, we model each data point (x i , y i ) as coming from its own function f θi . We show that this model subsumes Empirical Risk Minimization for many common loss functions and captures more realistic noise processes. We derive Functional Risk Minimization (FRM), a general framework for scalable training objectives that results in better performance in supervised, unsupervised, and reinforcement learning experiments. We also show that FRM can be seen as finding the simplest model that memorizes the training data, providing an avenue towards understanding generalization in the over-parameterized regime.

1. INTRODUCTION 1.MOTIVATION

In most machine learning settings, we only have limited control over how data is collected and even less so over the process generating it. For this reason, data is often correlated in complex ways, like data coming from similar times or locations. When these correlations are known, one can handle them appropriately as is done in frameworks such as multi-task or meta learning. However, in the absence of obvious reasons to specialize models to subsets of the data, practitioners often take an opposing perspective where differences in labels belonging to similar inputs are regarded as noise, often modeled in the output space. This idea serves as the basis for the training objectives we prefer, e.g., mean-squared error objective for gaussian noise or cross-entropy objective for multinomial distributions. By not accounting for highly structured noise, we expect that a singular model will appropriately average out noise differences during training. For instance, consider training a language model on Wikipedia, then fine-tuning it to work on a dataset of books. In doing so, we use two different functions f θ books and f θ wiki with f θ books ≈ f θ wiki . In contrast, when we train a model on general internet data, using Wikipedia and the dataset of books, we typically use a single function f θinternet , and we explain each training example with multinomial noise in output space, i.e., y i ∼ P (•|f internet (x i )). However, whether we arrange the data into different datasets or a single one, the datapoints remain the same. Therefore, it is contradictory to handle the same variability using two different models: functional diversity vs. output noise. To remedy this contradiction, this paper proposes to model noise in function space instead of output space. We propose Functional Generative Models (FGMs), where each point (x i , y i ) comes from its own (unseen) function, f θi , which fits it: y i = f θi (x i ). FGMs don't assume the existence of a privileged function f θ * , but consider a distribution over functions P(θ), see fig. 1 . Most supervised machine learning is based on variants of empirical risk minimization (ERM), which searches for a single function that best fits the training data. There, the training objective acts in output space, comparing the true answer with the prediction. In contrast, assuming that data comes from an FGM, we derive the Functional Risk Minimization (FRM) framework, where training objectives act in function space. Although the full version requires a high-dimensional integral, we derive a reasonable approximation that scales to training neural networks. Recently, neural networks have been observed to generalize despite memorizing the data, contradicting the classic understanding of ERM (Zhang et al., 2017) . Interestingly, we find a connection between FRM and a recent theory explaining this benign overfitting of over-parameterized neural networks under ERM. The main contributions of this work are the following: 1. We introduce Functional Generative Models, a simple class of generative models that assigns a function to each datapoint. 2. We derive the Functional Risk Minimization framework, compute a tractable and scalable approximation and link it to the generalization of over-parameterized neural networks. 3. We provide empirical results showcasing the advantages of FRM in supervised learning, unsupervised learning, and reinforcement learning.

2.1. INFERENCE AND RISK MINIMIZATION

In parametric machine learning, the user specifies a dataset D = ((x i , y i )) n i=1 , a parameterized function class f θ , and a loss function L (y, f θ (x)). Our goal is to design a learning framework that provides the θ that minimizes the expected risk over unseen data: min θ E L y, f θ (x) . However, since we do not have access to unseen data, we cannot compute this expectation.

Empirical risk minimization (ERM)

In machine learning, we often rely on variants of ERM where a loss function L evaluated on the given dataset is optimized, i.e., min θ n i=1 L(y i , f θ (x i )). However, what we want to have is low expected risk (test loss), not empirical risk (training loss). In general, the best choice for a training objective depends on the loss function L, but also on the (known) functional class f θ and the (unknown) data distribution P(x, y). Often, ERM can be seen as doing maximum likelihood by assuming a very particular noise model for the data that makes P(y|x) a function of P(y|f θ * (x)) for some unknown, but fixed, θ * . However, in general, the userdefined loss function L, and thus the optimal θ * , need not have any relation to the data distribution.

Bayesian learning

The Bayesian setting explicitly disentangles inference of P(y|x) from risk minimization of L. However, it usually assumes the existence of a true θ * , and further assumes it comes from some known prior q: θ * ∼ q(•). Then, similar to maximum likelihood, the Bayesian setting often assumes a noise model P(y|f θ * (x)) on the output. Thus inference about the posterior, P(θ|D) ∝ q(θ) • P(D|θ), becomes independent of the loss. Only in the final prediction step, the loss function is used, together with the posterior, to find the output with the lowest expected risk. Relations to FRM Similar to Bayesian learning, Functional Risk Minimization benefits from a clean distinction between inference and risk minimization. However, FRM assumes fundamental aleatory noise in function space P(θ), not to be confused with epistemic uncertainty in the Bayesian setting. Similar to ERM, FRM aims at only using a single parameter θ * at test-time, which avoids the challenging integration required in the Bayesian setting and its corresponding inefficiencies.

2.2. RELATED WORK

FGMs essentially treat each individual point as its own task or distribution. In this way, FGMs are related to multi-task learning (Thrun & Pratt, 1998) and meta-learning (Hospedales et al., 2020) . Within them, connections between learning to learn and Hierarchical Bayes are the most relevant (Tenenbaum, 1999; Griffiths et al., 2008; Grant et al., 2018) . Implementation-wise, FRM is closer to works looking at distances in parameter space (Nichol et al., 2018) or using implicit gradients (Lorraine et al., 2020; Rajeswaran et al., 2019) . However, these are still fundamentally ERM-based as noise is modeled in output space within each task. Other works have noted the importance of function space for applications such as minimizing catastrophic forgetting in continual learning (Kirkpatrick et al., 2017) , optimization (Martens & Grosse, 2015) , or exploration in reinforcement learning (Fortunato et al., 2017) . Information geometry (Amari, 2016) , formalizes the geometrical structure of distributions using tools from differential geometry. In contrast, we leverage stochasticity in function space for modeling and learning. Multiple alternatives to ERM have been proposed, particularly in the multi-task setting, such as adaptive (Zhang et al., 2020) and invariant risk minimization (Arjovsky et al., 2019) . It is also relevant the line aiming at flat minima (Hochreiter & Schmidhuber, 1997 )/minimizing sharpness (Foret et al., 2020) in order to improve generalization on standard supervised learning. In contrast to these works, our perturbations are per-point, and they come from the data distribution giving rise to the noise, instead of a regularization made on top of ERM with classic loss functions. Other works proposed per-point adaptations to tailor a model to each specific input either to encode an inductive bias (Alet et al., 2020; 2021) or adapt to a new distribution (Sun et al., 2019; Wang et al., 2020) . However, these adaptations fine-tune an imperfect model trained with ERM to get it closer to an ideal model. In contrast, in this work, per-point models are not a mechanism, but a fundamental part of reality, which then defines losses in function space rather than output space.

3.1. DESCRIPTION

In machine learning, we want to reach conclusions about a distribution P(x, y) from a finite dataset ((x i , y i )) n i=1 . However, there is no generalization without assumptions. From convolutions to graph neural networks and transformers, most research has focused on finding the right inductive biases for the mappings x → y. However, much less research has challenged the assumptions about the uncertainty of those mappings: P(y|x). For instance, whenever we minimize mean-squared error on an image-prediction problem we are doing maximum likelihood assuming gaussian noise in pixel space. However, the actual noise is usually much more structured, as we show in figure 2 . In this work, we start from a single principle, which we call Functional generative models (FGMs): we model each data-point (x i , y i ) as coming from its own function f θi such that y i = f θi (x i ) and θ i ∼ P(θ). Notably, P(θ) is unknown in the same way that we do not know P(x, y). FGMs can be seen as a special type of hierarchical Bayes (Heskes, 1998; Griffiths et al., 2008) , where each group has a single point, the lower-level is deterministic and each θ i is an unobserved latent variable. Example: predicting house prices with linear regression Let's consider predicting the price of a house given its surface area using a linear regressor: y = λx + β and the mean-squared error loss function. ERM would simply find the λ, β leading to the lowest squared error on the training data. This is equivalent to doing maximum likelihood on a gaussian noise model y i ∼ N (λx i + β, σ 2 ) with constant σ. However, this may be suboptimal. For instance, we intuitively know that prices of bigger houses tend to be higher, but also have larger variances: we expect the price of a large house to vary by 500k, but we would not expect the same 500k variation for a small house. When using the FRM framework, we assume that, for each house (x i , y i ) there are different λ i , β i , satisfying y i = λ i x i + β i . For instance, we may believe that agent commissions vary and are wellmodeled by β i , and that the price-per-meter-squared (captured by λ i ) changes depending on the neighborhood. This is the modeling made by FGMs, which is more flexible than the output-level noise model corresponding to mean-squared error. We show this effect in figure 3 .

3.2. PROPERTIES OF FGMS

FGMs model the arbitrariness of dataset definitions A dataset implicitly defines which points belong to the data distribution P(x, y) and which points do not. For instance, a dataset of houses sold in Boston in the last 5 years, doesn't contain houses sold in other cities, or Boston houses sold in 2005. Each of these categories would follow a slightly different distribution and, using Hierarchical Bayes, we could model them as similar parameter assignments to a single function class. More subtly, even the dataset of a single city encompasses multiple distributions, such as houses from different neighborhoods, years, or colors. These hidden intra-distributions are a source of noise when not described in the input. In the absence of any information, the least restrictive assumption is that each point comes from its own distribution, giving rise to what we refer as noise. It is therefore natural to use Hierarchical Bayes to model the differences in P(y i |x i ) from a single θ i ∼ P(θ). FGMs entrust what the user already trusts A user needs to provide a learning framework with three ingredients: a dataset ((x i , y i )) n i=1 , a function class f θ , and a loss function L. Compared to the Bayesian setting, FGMs don't assume an independent noise model, which may have little connection with the user specifications. Instead, they leverage the user's trust in the function class f θ to be a good model of the mapping x → y. They simply go one step further and also entrust the uncertainty in that mapping to the same function class, which now also models individual mappings x i → y i . FGMs encode structure through their function class FGMs draw their representational power from the function class f θ . Therefore, if the function class has a particular constraint, the FGM will have a corresponding constraint in probability space. For example, for the function class of linear functions, the expectation of P(y|x) is also linear. Similarly, as shown in figure 2 , using convolutional neural networks we can create meaningful, structured noise priors in image space. From graph neural networks and neural differential equations to probabilistic programs, FGMs leverage structured function maps to construct structured probability distributions. FGMs can be arbitrarily expressive FGMs assume that P(y|x) = P θ∼P(θ) [f θ (x) = y]. As just described, this need not be arbitrarily expressive. However, for some arbitrarily expressive function classes, such as multi-layer perceptrons, their corresponding FGM can be shown to be arbitrarily expressive, in probability space. We formalize this in the following definition. Definition 1. Given a function class F with parameterisation Θ, we define a Functional generative model (P(x), P(θ)) ∈ F GM [F Θ , X ] as a probability density function P(x, y) ∈ L 2 [X × Y] with x ∼ P(x) ∈ L 2 [X ], and y ∼ δ(f θ (x)), θ ∼ P(θ) ∈ L 2 [Θ]. Figure 4 : ERM with common losses is equivalent to maximum likelihood under an FGM that is only stochastic in the output parameters. The particular distribution depends on the loss: a) MSE with a Gaussian b) L1 with a Laplace c) cross-entropy with a Gumbel d) accuracy with a delta plus flat distribution. In practice, the axis for "other parameters" will often refer to thousands of parameters. Note that P(θ ∈ Θ) and P(x ∈ X ) are independent and y is deterministic given x, θ; see figure 1 . Theorem 1 (Universal Distribution Theorem). Let q(x, y) ∈ L 2 [X × Y], X = [0, 1] n ⊂ R n , Y = [0, 1] m ⊂ R m be a given probability density distribution function. Let F k Θ be the class of 3-layer neural networks with sigmoidal activation function and k neurons in the hidden layer. For any > 0, ∃K and a functional generative model (P(x), P(θ)) ∈ F GM F K Θ , X s.t. D T V ((P(x), P(θ)) , q) < , with D T V being the total variation distance. [Proof in appendix C.] FGMs is a superset of some instances of ERM In appendix B we prove that ERM for four common objectives (MSE, L1 loss, accuracy and cross-entropy) can be seen as a subcase of maximum likelihood on an FGM where all the stochasticity is restricted to the 'output' parameters. Figure 4 provides a visual intuition on how empirical losses correspond to functional losses in output space.

4. FUNCTIONAL RISK MINIMIZATION: LEARNING IN FUNCTION SPACE

Now, we look at the supervised learning problem under the FGM assumption.

4.1. MATCHING PROBABILITY DISTRIBUTIONS IN FUNCTION SPACE

We start with our goal to minimize the expected risk, impose the FRM generative model assumption and do basic math manipulations. In the derivation, whenever we use P(θ) we refer to an unknown probability distribution entirely characterized by the data distribution P(x, y) and function class f .  arg min θ * x θ L (f θ (x), f θ * (x)) P(θ)P(x)dθdx = (2) arg min θ * - θ P(θ) log e -x L(f θ (x),f θ * (x))P(x)dx dθ = (3) arg min θ * - θ P(θ) log e -ExL(f θ (x),f θ * (x)) • Z(θ * ) Z(θ * ) dθ = (4) arg min θ * H P(θ), e -ExL(f θ (x),f θ * (x)) Z(θ * ) -log (Z(θ * )) = (5) arg min θ * H (P(θ), Q θ * (θ)) -log (Z(θ * )). ( ) with H(P, Q) being the H cross-entropy operator and Q θ * (θ) = e -ExL(f θ (x),f θ * (x)) /Z(θ * ), Z(θ * ) = θ e -ExL(f θ (x),f θ * (x)) dθ being a class of probability distributions and their normalizers. To gain some intuition, we first observe that the second term -log Z(θ * ) = log 1/Z(θ * ) = log 1/ θ e -ExL(f θ (x),f θ * (x)) dθ is a label-independent regularizer that penalizes θ * leading to small θ e -ExL(f θ (x),f θ * (x)) dθ; i.e. a sharp distribution. Now, we can see that the first term is encouraging the matching of two probability distributions in function space: 1. P(θ): the unknown data-dependent distribution, which does not depend on the loss function L. This target distribution is defined entirely by the model class f and the unknown data distribution P(x, y), which we will have to estimate from the training data. 2. Q θ * (θ): a class of probability distributions which depends on the loss function L and the θ * used to make predictions, but not on the labels. This approximating distribution makes a parameter θ more likely the closer the function f θ is to f θ * according to the problemspecified loss L. Intuitively, it is a gaussian-like distribution centered at θ * , with a metric that captures the differences in task space. This will be formalized in section 4.2. This equation also shows that we need not know the exact shape and distribution of P(θ), which could be very complex without further assumptions. We only need to know its 'projection' to a particular class of probability distributions defined by the task at hand. This also happens in ERMbased learning: we need not know P(y|x) in order to estimate a x → y map. We would like to optimize equation 6, but we do not have access to samples for P(θ), we only have (x, y) pairs. However, we can compute the cross-entropy on P(y|x) following the FRM generative model. Thus, for a given dataset D train = ((x i , y i )) n i=1 the FRM objective is: arg max θ * (xi,yi) log θi:f θ i (xi)=yi e -Ex[L(f θ i (x),f θ * (x)] dθ i . Note that often we will not have access to the true input distribution P(x) to compute E x [L(f θi (x), f θ * (x)]. In that case, we can also estimate it from samples.

4.2. APPROXIMATING THE FRM OBJECTIVE BY LEVERAGING OVER-PARAMETERIZATION

Equation 7 is an integral in high dimensions under a non-linear constraint. In general, this is wellknown to be computationally challenging. Fortunately, for this particular class of systems, we can rely on over-paramterization to propose a reasonable approximation. First, as a sanity check, we observe that all constraints f θi (x i ) = y i are independent and that they all have a viable solution, as we are only trying to fit each single data-point (x i , y i ) with the entire parameter set θ i . For instance, even a constant model f (x) = c fits the data with c i = y i ∀i. In other words, the system (θ * , θ 1 , . . . , θ n ) is always over-parameterized. Moreover, it is often extremely over-parameterized. For reasonably parameterized models this is indeed the case: even small models of 10 4 parameters (compared to modern models of more than 10 10 parameters) may be underparameterized w.r.t. the entire dataset, but extremely over-parameterized w.r.t. fitting a single data point. Therefore, similar to the Neural Tangent Kernel literature (Jacot et al., 2018) for extremely wide neural networks, we can assume that a very small perturbation will be enough to fit each datapointfoot_0 . Now, we assume that we only need to analyze small perturbations ∆ i around a parameter θ * for |∆ i | << 1. We can therefore take the Laplace approximation of the probability distribution we want to fit and assume it is a Gaussian with mean at θ * : N (θ, H -1 f,L,θ * ), (H f,L,θ * ) j,k := ∂ 2 Ex[L(f θ+∆ (x),f θ )] ∂∆j ∂∆ k . Similarly, we can take the first-order Taylor approximation of the function f θ+∆ (x i ) ≈ f θ (x i )+J θ f θ (x i ) T •∆, assuming it is linear. Omitting the normalizer term, this leads to: arg max θ * (xi,yi) log ∆i:f θ (xi)+J θ f θ (xi) T •∆i=yi e -∆ T i •H f,L,θ * ∆i Z(θ * ) dθ i . Under these conditions, computing the likelihood of f θ+∆i fitting x i involves integrating a gaussian distribution over either a subspace (for regression) or a half-space (for binary classification). Regression We first note that the integral of the gaussian under a constraint can be seen as the pdf of y i ∼ J θ f θ (x i ) • ∆ + f θ (x i ), ∆ ∼ N (0, H -1 f,L ). Because it is a fixed linear transformation of a gaussian distribution it can also be expressed as a gaussian. In particular using the notation J i := J θ f θ (x i ), we have p(y i ) ∼ N f θ (x i ), J T i H -1 f,L J i . Computing its log-likelihood we obtain the following training objective where both J i and H f,L depend on θ: arg min θ n i=1 (y i -f θ (x i )) T J T i H -1 f,L J i -1 (y i -f θ (x i )) + n i=1 log |J T i H -1 f,L J i | (9) Classification For binary classification the solution is similar, except that we integrate over a halfspace instead of a hyper-plane. Thus, we take the gaussian ccdf (complementary cumulative distribution function) instead of the gaussian pdf. Therefore, to maximize the logprobability of a function fitting a point, we minimize the gaussian logcdf of the signed distance function to the decision boundary: min θ n i=1 logcdf (∆ i ) where ∆ i := sign σ(f θ (x i )) yi -1 2 min θi:σ(f θ i (xi))y i = 1 2 |θ i -θ| Σ f,L is the signed distance to the decision boundary. Note that in classification the best perturbation is not zero, but a very negative (i.e. opposite to the gradient) value, since this implies that the parameter θ is well within the correct classification region.This is also similar to regular ERM in binary cross-entropy, where we maximize the sigmoid, which has a very similar shape as the gaussian cdf. For multi-class classification the integral is over an intersection of C -1 half-spaces (comparing each class with the correct class y i ). The efficient integration in that case is still an active area of research (Gessner et al., 2020) . Two potential alternatives may be practical: turning the training of an n-way classification into n binary classifications, and linearizing the softmax of all incorrect classes jointly instead of linearizing each one independently. FRM implicitly assigns to every datapoint (x i , y i ) its own latent model f θi which fits it: f θ i (x i ) = y i . In this way, we can turn a model f θ into an over-parameterized hyper-model. Although θ i is unobserved in FGMs, the previous Taylor version of FRM becomes equivalent to this optimization: min θ1,...,θn:

4.3. FRM MAY DO EXPLICTLY WHAT OVER-PARAMETERIZED ERM DOES IMPLICITLY

f θ i (xi)=yi i,j |θ i -θ j | 2 M f,L,θ = min θ i min θi: f θ i (xi)=yi |θ i -θ| 2 M f,L,θ where explicit θ i are sought that are as close as possible according to the metric M. Whereas ERM finds the function that best fits the data among a class of simple functions, FRM finds the simplest hyper-model to explain the data, related to the principle of Occam's Razor. This can be seen as finding the simplest hyper-model {θ 1 , . . . , θ n } that fits the data. Simplicity is measured as the distance of parameters being close to a central parameter given a metric that captures the relationship between the function class f θ and the loss L. This encourages each independent function to be close to the central one, and thus all functions being close to each other, as shown in figure 6 . This is related to the line of research exposed by Bartlett et al. (2021) , which conjectures that ERM under gradient descent may implicitly find a function with two components f stable +f spiky , such that the spiky component has negligible norm but allows overfitting. In this regard, FRM can be seen as explicilty searching for the smallest necessary perturbation for each point.

5. EXPERIMENTS

To scale to neural networks, we leveraged the Taylor approximation in section 4.2. However, that requires inverting a Hessian, which is usually too big to even instantiate in memory. We bypassed this Figure 7 : Ratio of train and test error between ERM and FRM as a function of the ratio between noise in the scale vs. offset components in 1-D and 10-D linear regression. As expected, we can see that ERM always has lower training loss as well as slightly lower test loss (12% lower) when its assumption (gaussian noise only on the offset) is perfectly satisfied. When noise is heteroscedastic, ERM has up to 40% higher test error. In 10 dimensions, the advantage of FRM is even starker: ERM can have 4 times more test error than FRM, despite having lower training error. problem by 1) relying on iterative solvers to avoid the cubic cost and 2) materializing only Hessianvector products. To do so, we use JAX (Bradbury et al., 2018) and the jaxopt package (Blondel et al., 2021) , which implements implicit gradients.

5.1. LINEAR LEAST SQUARES

To better understand the trade-offs between FRM and ERM we analyze the simple case of linear regression under mean-squared error risk. We consider a d-dimensional input and a one-dimensional output. The classic ERM solution minimizes the risk on the training data: min λ,β (xi,yi) (λ • x i + β -y i ) 2 . This is equal to doing maximum likelihood on a fixed gaussian noise on β. Thus, we expect ERM to do well in this situation, but not necessarily otherwise. For linear regression with squared loss, the Taylor approximations in section 4.2 are exact. Furthermore, both the Hessian and the gradients are independent of the parameters, which further simplifies the objective function to just a specific re-weighting of the per-point risks: 7 shows that indeed ERM does slightly better with gaussian noise in the bias, but FRM does much better when the noise is entirely in the slope. We also observe that the FRM is more than 4 times better in higher dimensions. min λ,β (xi,yi) (λ•xi+β-yi) 2 [xi,1]H -1 [xi,1] T , with H = E x [x, 1] T [x, 1] . Figure

5.2. VALUE FUNCTION ESTIMATION

We demonstrate here that the proposed approach can be broadly applied on an illustrative offline value estimation task using the mountain car domain (Sutton & Barto, 2018) . We consider the problem of learning a linear value function using a 15 × 15 grid of radial basis functions (RBFs) using the 1-step temporal difference (TD) error (Sutton, 1988) as the training loss function and using sampled transition gathered by a near-optimal policy. Both approaches were optimized with stochastic gradient descent with a constant learning rate best suited for it, selected by a grid search over hyper-parameters, and a batch size of 256. Performance is then evaluated using the root mean squared error (RMSE) between predictions and the true values on unseen samples. We consider two different arrangements of RBFs, a uniform layout and one that is denser towards the center of the environment. Note that although the true value function has a discontinuity spiraling out from the center, which might benefit from finer resolution, the more poorly conditioned nature of this non-uniform arrangement of features makes the problem harder, as can be seen in figure 8 . We see that FRM is competitive in the easier of the two cases while outperforming ERM by over 20% in the harder one. We hypothesize that TD loss is commonly subject to complex noise that can severely hinder ERM when its features are poorly aligned. Furthermore, due to the use of bootstrapping (L(s, r, s ) := (f θ (s) -r -γf θ (s )) 2 ) the temporal difference error is inherently functional through the term f θ (s ) affecting the label. 

5.3. FGM-BASED VAE FINDS BETTER REPRESENTATIONS WITHIN STRUCTURED VARIATIONS

To better understand when FRM works better than ERM, we build a Variational AutoEncoder(VAE) on top of MNIST (LeCun et al., 1998) and combinations of two popular variations: colored MNIST (Arjovsky et al., 2019) and translated MNIST (Jaderberg et al., 2015) . We build a vanilla VAE with MLP encoder and CNN decoder. Then, we evaluate the quality of the representation to do classification for the vanilla VAE and an FGM-based decoder where noise is modeled in function space. For FGM, we train a small MLP on top of the latent representation, with a stop-gradient, and measure accuracy depending on the size of the latent. We see that in MNIST, where natural variations in orientation, translation, and color have been unnaturally removed, some gains exist but are small. In the datasets containing variations in color or translation, the FRM gains are substantial. This is because noise in CNN weights can easily explain these structured variations, as shown in figure 2 . Similarly, papers such as Deep Image Prior (Ulyanov et al., 2018) have argued that neural networks are good models for real-world variability, making FRM particularly appealing for modeling real-world data. Results are shown in figure 9 .

6. CONCLUSION

The main limitation of FRM in its current form is its compute cost. Thanks to the approximations proposed in sections 4.2 and 5 we can run FRM on a ResNet-50 using a single GPU, but with a prohibitive iteration cost. However, long term, FRM could be orders of magnitude more efficient than ERM-based approaches. As explained in section 4.3, under-parameterized FRM may behave similarly to over-parameterized ERM by making models have n times more parameters θ 1 , . . . , θ n . There, each θ i is instantiated on the fly for loss computation and thus doesn't need to be in memory, this could provide orders of magnitude of benefit for modern datasets where often n > 10 6 . In the last years, there has been a clear tendency towards building large models capable of performing many tasks which were previously modeled individually. FGMs propose the natural step to model the diversity in these datasets in function space rather than output space, allowing for richer and more meaningful noise models. Despite noise being pervasive across real-world data, modern deep learning approaches still largely use simple unstructured noise models. As we keep moving towards larger, more varied datasets, properly modeling the internal data diversity will become crucial. We believe FRM provides a first step towards an effective solution. 

A FUNCTIONAL NOISE IN A CNN

To show the value of the Taylor approximation, we create a dataset by sampling different parameter assignments on a 4-layer CNN architecture. The CNN takes in a CIFAR-10 image and outputs a real number. We provide only 8 labels to each method, allowing empirical risk minimization to easily memorize the dataset. Despite FRM obtaining substantially higher training losses (.000 vs .052), we observe FRM obtains significantly less test error (.125 to .085). We also test the ability of FRM to modify its training depending on the loss function. Although this is obviously the case for ERM, in approximate FRM the loss function enters only in an indirect way, affecting the hessian in equation 8. We modify the objective by creating two different losses, which assign zero loss to labels that are either positive or negative, respectively. Table 1 shows that indeed FRM performs better when trained and tested on the same loss (0.085 vs 0.128). (f (x i ) -y i ) 2 , i.e. the mean-squared error loss. Lemma 1. For any arbitrary function class f θ,β (x) expressible as f θ,β (x) = f θ (x) + β, there exists a functional loss restricted to functional adaptations θ i = θ that only change β → β i which is equivalent to the mean-squared error loss.

B PROOFS OF EMPIRICAL LOSSES

Proof Since we can only change β there is a single solution to the per-point constraint: f θ (x i ) = f θ (x i ) + β i = y i ⇒ β i = y i -f θ (x i ). We can now model the probability distribution over functions F (θ, β i |θ, β, L M SE ) as a gaussian centered at (θ, β). Since θ doesn't change, this will just be N (β i -β). Maximizing the mean of the log-probabilities will result in 1 n i log N (β i -β) = 1 n i (β i -β) 2 = 1 n i (y i -f θ (x i ) -β) 2 = 1 n i (y i -f θ,β (x i )) 2 = L M SE . Of note, the Gaussian model of the functional distribution satisfies F (θ, β i |θ, β, L M SE ) = N ((θ, β i ) -(θ, β)) ∝ e -|β-βi| 2 = e -ExL M SE (f θ,β ,f θ ,β ) . This is because for all x, L M SE (f θ,β (x) -f θ ,β (x)) = |f θ,β (x) -f θ ,β (x)| 2 = |β -β | 2 . Finally, we note that the entire derivation can be equivalently followed for the L1 loss by swapping | • | 2 for | • | and the Gaussian distribution for the Laplace distribution.

B.2 CLASSIFICATION ERROR AS A FUNCTIONAL LOSS

Let us now look at multi-class classification and let our dataset D train = {(x i , y i )} n i=1 , y i ∈ {1, . . . , C}. Our function class will output in an unconstrained logit space R C and we define L cls = 1 n n i=1 1 y i = arg max c (f θ,β (x i )) c , i.e. the classification error. As in previous sections, abusing notation we will refer to 1 y i = arg max c (f θ,β (x i )) c as 1 y i = f θ,β (x i ) . Lemma 2. For any arbitrary function class f θ,β (x) expressible as f θ,β (x) = f θ (x) + β, β ∈ R c , constrained on f θ (x) being finite, there exists a functional loss restricted to functional adaptations θ i = θ that only change β → β i which is equivalent to the classification error. Proof We will show that a solution is given by F (θ, β i |θ, β, L cls ) = p • δ(β i -β) + (1 - p) lim σ→∞ N (0, σ)(β), with p = e-1 C+e-1 ∈ (0, 1). In other words, a specific positive (note the open brackets) combination of an infinitely-sharp distribution (Dirac's delta) with an infinitely-flat distribution. Given a fixed p, θ, β, the probability of y i = arg max c f θi,βi (x i ) will be equal to p • y i = arg max c (f θ,β ) c + 1-p C . This comes directly from the definition of the functional probability distribution: with probability p, we have (θ i , β i ) = (θ, β) and thus the result depends solely on (θ, β); with probability (1 -p) the logits are perturbed by an infinitely strong noise and thus the arg max will just be a uniform distribution over the classes, i.e. 1 C . Now, the average log-likelihood of the functional loss will be: 1 n n i=1 log p • 1 y i = f θ,β (x i ) + 1 -p C = log 1 -p C + 1 n n i=1 log p • 1 y i = f θ,β (x i ) + (1 -p)/C (1 -p)/C = log 1 -p C + log p + (1 -p)/C (1 -p)/C 1 n n i=1 1 y i = f θ,β (x i ) = log 1 -p C + log 1 + pC 1 -p L cls = -log (C + e -1) + L cls . where in the second step we observe that the log term within the sum is zero when y i = f θ,β (x i ) and, in the last step, we have set p = e-1 C+e-1 , which by construction is in (0, 1). We can now easily see that this is equivalent to L cls up to a constant additive term, which will not affect any optimization.

B.3 CROSS-ENTROPY LOSS AS A FUNCTIONAL LOSS

Continuing in multi-class classification and let our dataset D train = {(x i , y i )} n i=1 , y i ∈ {1, . . . , C}. Our function class will output in an unconstrained logit space R C and we define L CE = 1 n n i=1 log σ (f θ,β ) yi , i.e. the cross-entropy loss. Here, σ(•) c corresponds to taking the c-th component of the softmax of a given logit to obtain the probability of a given class c given the logit predictions. Lemma 3. For any arbitrary function class f θ,β (x) expressible as f θ,β (x) = f θ (x) + β, β ∈ R C , there exists a functional loss restricted to functional adaptations θ i = θ that only change β → β i which is equivalent to the cross-entropy loss. Proof As shown in (Jang et al., 2016; Maddison et al., 2016) if we have logits γ c = f θ (x i ) c + β c we can sample from the probability distribution of distribution equal to σ(γ) by c = arg max i (γ i + g i ) where each g i follows an independent Gumbel distribution, i.e. g i = -log(-log u i ), u i ∼ U(0, 1). This gives us a trivial expression for a functional distribution over which to make maximum likelihood: β i ∼ β + G, where G consists of c independent Gumbel noise variables. This is because, since β lives in logit space, adding noise to β is equivalent to adding noise to the logits. Finally, since the cross-entropy loss is the maximum likelihood assuming a probability distribution given by the logits and we have shown a functional distribution with the same distribution, performing maximum likelihood on that distribution is equivalent to minimizing the cross-entropy loss.

C UNIVERSAL DISTRIBUTION THEOREM

Definition 2. Given a function class F with parameterisation Θ, we define a Functional generative model (P (x), P (θ)) ∈ F GM [F Θ , X ] as a probability density function p(x, y) ∈ L 2 [X × Y] with x ∼ P (x) ∈ L 2 [X ], and y = f θ (x), θ ∼ P (θ) ∈ L 2 [Θ]. Note that, in particular, P (θ ∈ Θ) and P (x ∈ X ) are independent and y is deterministic given x, θ; as shown in figure 1 . Theorem 2 (Universal Distribution Theorem). Let q(x, y) ∈ L 2 [X × Y], X = [0, 1] n ⊂ R n , Y = [0, 1] m ⊂ R m be a given probability density distribution function. Let F k Θ be the class of 3-layer neural networks with sigmoidal activation function and k neurons in the hidden layer. For any > 0, ∃K and a functional generative model (P (x), P (θ)) ∈ F GM F K Θ , X s.t. D T V ((P (x), P (θ)) , q) < , with D T V being the total variation distance. For the first layer we use deterministic weights with arbitrarily-big slope to implement the functions 1 x i ≥ c j for all coordinates 1 ≤ i ≤ n and c j = {-1, -1 + , . . . , 1 -, 1}. For the second layer, we again use deterministic weights to implement functions 1 x ∈ [a 1 , a 1 + ) × • • • × [a n , a n + ) to determine whether a given input is within a hyper-cube of side . Exactly one of those two-layer nodes will be active for any given input. From the node corresponding to [a 1 , a 1 + ) × • • • × [a n , a n + ) the each of the output nodes there are m weights θ, we assign them a distribution equal to θ 1:m ∼ P (y|x = (a 1 , . . . , a n )). Because P (y|x) is continuous, P (y|x = (a 1 , . . . , a n )) will be arbitrarily close to P (y|x) for any x in the hyper-cube [a 1 , a 1 + ) × • • • × [a n , a n + ) for a sufficiently-small . We note that this universality also holds for a 2-layer neural network as well (also a universal function class). However, the prove for that case is more cumbersome and less insightful for our purposes.

D FURTHER UNDERSTANDING THE DIFFERENCE BETWEEN ERM AND FRM

The ERM assumption: by assuming that the training objective is equal to the test loss L, ERM can be suboptimal for certain P(θ), like the house example on section 3. As shown in appendix B, for many loss functions L, including most of the common ones, ERM is equivalent to assuming the functional generative model and then doing maximum likelihood on P(θ) by assuming it has a form parameterized by θ whose uncertainty is only on the output offset parameters. In other words, the assumption equivalent to performing ERM is often strictly more assuming than FGMs. For instance, consider predicting the price of different houses as a function of their size and having MSE as the loss. Doing empirical risk minimization with the MSE would be equivalent to doing maximum likelihood on the following price model: y i ∼ N (f (x i ), σ 2 ). However, we would expect noise to be heteroskedastic with higher variations for higher prices.Thus, even if we are evaluated on MSE on the test data, it may not be advisable to use it as our training criteria. Similarly, consider a child learning a concept from examples on a textbook rather than from standardized images of a dataset. Images may receive different illuminations from the sunlight, or be in different positions than we expect. These factors will produce massive changes in pixel space, but in very structured ways (fig 2 ). However, humans can still easily grasp the idea because the 'conceptual' noise is small. How can we have more meaningful noise models? By construction, we will often believe that the function class f θ is a good characterization of the relationship between x and y. It is thus a natural assumption to define a noise model by leveraging the function class itself. More concretely, we can think of a generative model of the data as first sampling the input x i , then sampling a function f i ∼ F(L, θ) from some parameterized distribution over functions, which will depend on both the problem-specified loss function L as well as the function class f θ . Once the function and the input have been sampled, the output is automatically determined y i = f i (x i ), see the right of figure 1 . For example, in our house-price prediction, if we are using a linear model f (x) = λ • x + β, then it makes sense to think about our data as coming from first sampling x i ∼ p(x) and (λ i , β i ) ∼ F(L, (λ, β)), then computing y i = λ i • x i + β i , as shown on the right of figure 1. For instance, β i can model different commissions or taxes, and λ i can model the per-meter-square price being variable across neighborhoods. Even if we care about making accurate predictions in dollar-space, assuming our uncertainty is only in the offset term β i may be too restrictive.

D.1 ERM VS FRM FOR THE LINEAR CASE

Let us now take a deeper look at our linear regression example. We have a dataset D train = {(x i , y i )}, depicted in the top-right of figure 3a , with an arbitrary color per point. For every point, there is a subspace of models (λ i , β i ) s.t. λ i x i + β i = y i . Since we only have two parameters, we can also look at function-space in 2-D, and plot the corresponding subspace for each point, in the bottom-left of figure 3a. We observe that every point gives us a line in function space, which we plot with the corresponding color. Our goal is to produce a probability distribution P(λ, β) such that the sum of the log-densities of each line (λ i , β i ) λixi+βi=yi is maximal. Intuitively, this means that each line should pass through a high-density area of the probability distribution, but it does not mean that the line should be covered by the high-density area (which is not possible, since they're unbounded). This can be seen in figure 3b where all lines pass near the center of the distribution generating the data (marked in green). We can further see that ERM with the MSE loss is equivalent to finding a point (λ ERM , β ERM ) that minimizes the vertical distance to each line: (λ ERM , β ERM ) = min λ,β i (y i -λx i -β) 2 = min λ,β, λ i :λ i =λ i (y i -λ i x i -β) 2 = min λ,β,{λ i ,β i }: λ i =λ, λixi+βi=yi, i (β i -β) 2 . In contrast, if the probability distribution in parameter space is a Gaussian, FRM involves taking the distance of the entire vector (λ, β), using the inverse covariance matrix as the metric. For cases where most of the uncertainty is in the slope, as in figure 3b , ERM measures the distance in the vertical direction and FRM measures it almost horizontally, leading to different results.

D.2 VISUALIZATION FOR A SIMPLE FULLY CONVOLUTIONAL NETWORK

Figure 2 shows the difference between MSE and its functional correspondent for a small fully-convolutional network mapping images to images f θ . Images y with the same empirical loss |y -f θ (x)| 2 could require very different functional adaptations to explain: min θ :f θ (x)=y |θ -θ| f,L . For instance, if one does edge detection and mistakenly translates its prediction a bit to the right, this small change in functional space could lead to a large error in pixel space. Similarly, if we have a pattern detector and we slightly change its threshold, it could make the entire prediction darker or lighter. Conversely, if we add unstructured noise onto our image, it is to be expected that it will have a high functional loss as no small perturbation of the function could simultaneously explain pure noise. That's indeed what we observe in figure 2b when we look for images with high and low functional loss for a fixed empirical loss. Images with high functional loss contain salt-and-pepper-like noise that breaks the smooth pattern of the original image. In contrast, images with low functional loss preserve the overall structure while uniformly shifting large blocks of pixels to a much lighter color. If the noise in our data is better represented by our functional class than noise in the output, we can take this into account to improve learning.



Note that this justifies that there is a large probability mass for |θi -θ| << 1, but it does not justify that this is an accurate approximation of the entire integral. However, this is a common and useful approximation.



Figure 1: For many common losses, ERM and FRM can be related to maximum likelihood under simple generative models. Red lines ending in a circle are stochastic, blue arrows are deterministic.

(a) Changes in light and translation naturally cause large, but structured, variations. (b) Images with high and low functional loss for a series of fixed empirical losses, when predicting the edges of the image on the left. The model is a simple twolayer fully-convolutional network. One can see that images with low functional loss retain most of the structure despite having high errors in output (pixel) space.

Figure 2: Functional losses provide a way to capture structured noise, typical in natural settings.

(a) Functional subspaces (lines in this example) that fit each point in the dataset (top plot). Each line in is colored according to its datapoint. (b) The best parameter distribution (in green) being quite certain in the offset β, and uncertain in the slope λ.

Figure 3: Functional generative models for a linear function class in house price prediction. Since we only have two parameters, we can plot the function space in 2D on the bottom-left of each sub-figure, with the actual data is plotted on the top-right.

Figure 5: Finding the projection of the unknown distribution P(θ) to the family Q θ * (θ) of probability distributions in function space. Here θ * 3 (green) is best.

Figure 6: Minimal functional adaptations using a generalized linear model with Fourier features.

Figure 8: Comparison of the RMSE for ERM and FRM for the learned value function in mountain car under a fixed policy using a temporal difference loss with different features: (left) using a uniform grid of radial basis functions, (right) using a distorted grid of radial basis functions denser in the middle. Solid lines are the average over 20 seeds; shaded areas show the 95th percentile interval.

Figure 9: Accuracies of an MLP trained from latents of two CNN-based VAEs, trained with ERM and FRM. FRM provides small gains in vanilla MNIST, and large gains in all three variants.

BEING SUB-CASES OF FUNCTIONAL LOSSES B.1 MEAN-SQUARED ERROR AND L 1 LOSS AS A FUNCTIONAL LOSSES Let our dataset D train = {(x i , y i )} n i=1 , y i ∈ R 1 , and let L M SE = 1 n n i=1

FRM outperforms ERM in a small CNN environment despite ERM having 0 training loss. Furthermore, the Hessian can be enough to express the dependence on the loss function. .000 .283 ± .016 .130 ± .006 .278 ± .013 negatives .336 ± .020 .000 ± .000 .323 ± .018 .119 ± .005 FRM positives .052 ± .002 .109 ± .007 .085 ± .004 .124 ± .006 negatives .131 ± .010 .052 ± .002 .136 ± .009 .084±.004

