FUNCTIONAL RISK MINIMIZATION

Abstract

In this work, we break the classic assumption of data coming from a single function f θ * (x) followed by some noise in output space P(y|f θ * (x)). Instead, we model each data point (x i , y i ) as coming from its own function f θi . We show that this model subsumes Empirical Risk Minimization for many common loss functions and captures more realistic noise processes. We derive Functional Risk Minimization (FRM), a general framework for scalable training objectives that results in better performance in supervised, unsupervised, and reinforcement learning experiments. We also show that FRM can be seen as finding the simplest model that memorizes the training data, providing an avenue towards understanding generalization in the over-parameterized regime.

1. INTRODUCTION 1.MOTIVATION

In most machine learning settings, we only have limited control over how data is collected and even less so over the process generating it. For this reason, data is often correlated in complex ways, like data coming from similar times or locations. When these correlations are known, one can handle them appropriately as is done in frameworks such as multi-task or meta learning. However, in the absence of obvious reasons to specialize models to subsets of the data, practitioners often take an opposing perspective where differences in labels belonging to similar inputs are regarded as noise, often modeled in the output space. This idea serves as the basis for the training objectives we prefer, e.g., mean-squared error objective for gaussian noise or cross-entropy objective for multinomial distributions. By not accounting for highly structured noise, we expect that a singular model will appropriately average out noise differences during training. For instance, consider training a language model on Wikipedia, then fine-tuning it to work on a dataset of books. In doing so, we use two different functions f θ books and f θ wiki with f θ books ≈ f θ wiki . In contrast, when we train a model on general internet data, using Wikipedia and the dataset of books, we typically use a single function f θinternet , and we explain each training example with multinomial noise in output space, i.e., y i ∼ P (•|f internet (x i )). However, whether we arrange the data into different datasets or a single one, the datapoints remain the same. Therefore, it is contradictory to handle the same variability using two different models: functional diversity vs. output noise. To remedy this contradiction, this paper proposes to model noise in function space instead of output space. We propose Functional Generative Models (FGMs), where each point (x i , y i ) comes from its own (unseen) function, f θi , which fits it: y i = f θi (x i ). FGMs don't assume the existence of a privileged function f θ * , but consider a distribution over functions P(θ), see fig. 1 . Most supervised machine learning is based on variants of empirical risk minimization (ERM), which searches for a single function that best fits the training data. There, the training objective acts in output space, comparing the true answer with the prediction. In contrast, assuming that data comes from an FGM, we derive the Functional Risk Minimization (FRM) framework, where training objectives act in function space. Although the full version requires a high-dimensional integral, we derive a reasonable approximation that scales to training neural networks. Recently, neural networks have been observed to generalize despite memorizing the data, contradicting the classic understanding of ERM (Zhang et al., 2017) . Interestingly, we find a connection between FRM and a recent theory explaining this benign overfitting of over-parameterized neural networks under ERM. The main contributions of this work are the following: 1. We introduce Functional Generative Models, a simple class of generative models that assigns a function to each datapoint. 2. We derive the Functional Risk Minimization framework, compute a tractable and scalable approximation and link it to the generalization of over-parameterized neural networks. 3. We provide empirical results showcasing the advantages of FRM in supervised learning, unsupervised learning, and reinforcement learning.

2.1. INFERENCE AND RISK MINIMIZATION

In parametric machine learning, the user specifies a dataset D = ((x i , y i )) n i=1 , a parameterized function class f θ , and a loss function L (y, f θ (x)). Our goal is to design a learning framework that provides the θ that minimizes the expected risk over unseen data: min θ E L y, f θ (x) . However, since we do not have access to unseen data, we cannot compute this expectation.

Empirical risk minimization (ERM)

In machine learning, we often rely on variants of ERM where a loss function L evaluated on the given dataset is optimized, i.e., min θ n i=1 L(y i , f θ (x i )). However, what we want to have is low expected risk (test loss), not empirical risk (training loss). In general, the best choice for a training objective depends on the loss function L, but also on the (known) functional class f θ and the (unknown) data distribution P(x, y). Often, ERM can be seen as doing maximum likelihood by assuming a very particular noise model for the data that makes P(y|x) a function of P(y|f θ * (x)) for some unknown, but fixed, θ * . However, in general, the userdefined loss function L, and thus the optimal θ * , need not have any relation to the data distribution.

Bayesian learning

The Bayesian setting explicitly disentangles inference of P(y|x) from risk minimization of L. However, it usually assumes the existence of a true θ * , and further assumes it comes from some known prior q: θ * ∼ q(•). Then, similar to maximum likelihood, the Bayesian setting often assumes a noise model P(y|f θ * (x)) on the output. Thus inference about the posterior, P(θ|D) ∝ q(θ) • P(D|θ), becomes independent of the loss. Only in the final prediction step, the loss function is used, together with the posterior, to find the output with the lowest expected risk. Relations to FRM Similar to Bayesian learning, Functional Risk Minimization benefits from a clean distinction between inference and risk minimization. However, FRM assumes fundamental aleatory noise in function space P(θ), not to be confused with epistemic uncertainty in the Bayesian setting. Similar to ERM, FRM aims at only using a single parameter θ * at test-time, which avoids the challenging integration required in the Bayesian setting and its corresponding inefficiencies.

2.2. RELATED WORK

FGMs essentially treat each individual point as its own task or distribution. In this way, FGMs are related to multi-task learning (Thrun & Pratt, 1998) and meta-learning (Hospedales et al., 2020) . Within them, connections between learning to learn and Hierarchical Bayes are the most relevant (Tenenbaum, 1999; Griffiths et al., 2008; Grant et al., 2018) . Implementation-wise, FRM is closer to works looking at distances in parameter space (Nichol et al., 2018) or using implicit



Figure 1: For many common losses, ERM and FRM can be related to maximum likelihood under simple generative models. Red lines ending in a circle are stochastic, blue arrows are deterministic.

