IS LIKELY THAT YOUR LOSS SHOULD BE A LIKELIHOOD

Abstract

Many common loss functions such as mean-squared-error, cross-entropy, and reconstruction loss are unnecessarily rigid. Under a probabilistic interpretation, these common losses correspond to distributions with fixed shapes and scales. We instead argue for optimizing full likelihoods that include parameters like the normal variance and softmax temperature. Joint optimization of these "likelihood parameters" with model parameters can adaptively tune the scales and shapes of losses in addition to the strength of regularization. We explore and systematically evaluate how to parameterize and apply likelihood parameters for robust modeling, outlier-detection, and re-calibration. Additionally, we propose adaptively tuning L 2 and L 1 weights by fitting the scale parameters of normal and Laplace priors and introduce more flexible element-wise regularizers.

1. INTRODUCTION

Choosing the right loss matters. Many common losses arise from likelihoods, such as the squared error loss from the normal distribution , absolute error from the Laplace distribution, and the cross entropy loss from the softmax distribution. The same is true of regularizers, where L 2 arises from a normal prior and L 1 from a Laplace prior. Deriving losses from likelihoods recasts the problem as a choice of distribution which allows datadependent adaptation. Standard losses and regularizers implicitly fix key distribution parameters, limiting flexibility. For instance, the squared error corresponds to fixing the normal variance at a constant. The full normal likelihood retains its scale parameter and allows optimization over a parametrized set of distributions. This work examines how to jointly optimize distribution and model parameters to select losses and regularizers that encourage generalization, calibration, and robustness to outliers. We explore three key likelihoods: the normal, softmax, and the robust regression likelihood ρ of Barron (2019) . Additionally, we cast adaptive priors in the same light and introduce adaptive regularizers. Our contributions: 1. We systematically survey and evaluate global, data, and predicted likelihood parameters and introduce a new self-tuning variant of the robust adaptive loss ρ 2. We apply likelihood parameters to create new classes of robust models, outlier detectors, and re-calibrators. 3. We propose adaptive versions of L1 and L2 regularization using parameterized normal and Laplace priors on model parameters.

2. BACKGROUND

Notation We consider a dataset D of points x i and targets y i indexed by i ∈ {1, . . . , N }. Targets for regression are real numbers and targets for classification are one-hot vectors. The model f with parameters θ makes predictions ŷi = f θ (x). A loss L(ŷ, y) measures the quality of the prediction given the target. To learn model parameters we solve the following loss optimization: min θ E (x,y)∼D L(ŷ = f θ (x), y)

