FUNCTIONAL RISK MINIMIZATION

Abstract

In this work, we break the classic assumption of data coming from a single function f θ * (x) followed by some noise in output space P(y|f θ * (x)). Instead, we model each data point (x i , y i ) as coming from its own function f θi . We show that this model subsumes Empirical Risk Minimization for many common loss functions and captures more realistic noise processes. We derive Functional Risk Minimization (FRM), a general framework for scalable training objectives that results in better performance in supervised, unsupervised, and reinforcement learning experiments. We also show that FRM can be seen as finding the simplest model that memorizes the training data, providing an avenue towards understanding generalization in the over-parameterized regime.

1. INTRODUCTION 1.MOTIVATION

In most machine learning settings, we only have limited control over how data is collected and even less so over the process generating it. For this reason, data is often correlated in complex ways, like data coming from similar times or locations. When these correlations are known, one can handle them appropriately as is done in frameworks such as multi-task or meta learning. However, in the absence of obvious reasons to specialize models to subsets of the data, practitioners often take an opposing perspective where differences in labels belonging to similar inputs are regarded as noise, often modeled in the output space. This idea serves as the basis for the training objectives we prefer, e.g., mean-squared error objective for gaussian noise or cross-entropy objective for multinomial distributions. By not accounting for highly structured noise, we expect that a singular model will appropriately average out noise differences during training. For instance, consider training a language model on Wikipedia, then fine-tuning it to work on a dataset of books. In doing so, we use two different functions f θ books and f θ wiki with f θ books ≈ f θ wiki . In contrast, when we train a model on general internet data, using Wikipedia and the dataset of books, we typically use a single function f θinternet , and we explain each training example with multinomial noise in output space, i.e., y i ∼ P (•|f internet (x i )). However, whether we arrange the data into different datasets or a single one, the datapoints remain the same. Therefore, it is contradictory to handle the same variability using two different models: functional diversity vs. output noise. To remedy this contradiction, this paper proposes to model noise in function space instead of output space. We propose Functional Generative Models (FGMs), where each point (x i , y i ) comes from its own (unseen) function, f θi , which fits it: y i = f θi (x i ). FGMs don't assume the existence of a privileged function f θ * , but consider a distribution over functions P(θ), see fig. 1 . Most supervised machine learning is based on variants of empirical risk minimization (ERM), which searches for a single function that best fits the training data. There, the training objective acts in output space, comparing the true answer with the prediction. In contrast, assuming that data comes from an FGM, we derive the Functional Risk Minimization (FRM) framework, where training objectives act in function space. Although the full version requires a high-dimensional integral, we derive a reasonable approximation that scales to training neural networks.



Recently, neural networks have been observed to generalize despite memorizing the data, contradicting the classic understanding of ERM(Zhang et al., 2017). Interestingly, we find a connection between FRM and a recent theory explaining this benign overfitting of over-parameterized neural networks under ERM.

