IT IS LIKELY THAT YOUR LOSS SHOULD BE A LIKELIHOOD

Abstract

Many common loss functions such as mean-squared-error, cross-entropy, and reconstruction loss are unnecessarily rigid. Under a probabilistic interpretation, these common losses correspond to distributions with fixed shapes and scales. We instead argue for optimizing full likelihoods that include parameters like the normal variance and softmax temperature. Joint optimization of these "likelihood parameters" with model parameters can adaptively tune the scales and shapes of losses in addition to the strength of regularization. We explore and systematically evaluate how to parameterize and apply likelihood parameters for robust modeling, outlier-detection, and re-calibration. Additionally, we propose adaptively tuning L 2 and L 1 weights by fitting the scale parameters of normal and Laplace priors and introduce more flexible element-wise regularizers.

1. INTRODUCTION

Choosing the right loss matters. Many common losses arise from likelihoods, such as the squared error loss from the normal distribution , absolute error from the Laplace distribution, and the cross entropy loss from the softmax distribution. The same is true of regularizers, where L 2 arises from a normal prior and L 1 from a Laplace prior. Deriving losses from likelihoods recasts the problem as a choice of distribution which allows datadependent adaptation. Standard losses and regularizers implicitly fix key distribution parameters, limiting flexibility. For instance, the squared error corresponds to fixing the normal variance at a constant. The full normal likelihood retains its scale parameter and allows optimization over a parametrized set of distributions. This work examines how to jointly optimize distribution and model parameters to select losses and regularizers that encourage generalization, calibration, and robustness to outliers. We explore three key likelihoods: the normal, softmax, and the robust regression likelihood ρ of Barron (2019) . Additionally, we cast adaptive priors in the same light and introduce adaptive regularizers. Our contributions: 1. We systematically survey and evaluate global, data, and predicted likelihood parameters and introduce a new self-tuning variant of the robust adaptive loss ρ 2. We apply likelihood parameters to create new classes of robust models, outlier detectors, and re-calibrators. 3. We propose adaptive versions of L1 and L2 regularization using parameterized normal and Laplace priors on model parameters.

2. BACKGROUND

Notation We consider a dataset D of points x i and targets y i indexed by i ∈ {1, . . . , N }. Targets for regression are real numbers and targets for classification are one-hot vectors. The model f with parameters θ makes predictions ŷi = f θ (x). A loss L(ŷ, y) measures the quality of the prediction given the target. To learn model parameters we solve the following loss optimization: A likelihood L(ŷ|y, φ) measures the quality of the prediction as a distribution over ŷ given the target y and likelihood parameters φ. We use the negative log-likelihood (NLL), and the likelihood interchangeably since both have the same optima. We define the full likelihood optimization: min θ E (x,y)∼D L(ŷ = f θ (x), y) min θ,φ E (x,y)∼D (ŷ = f θ (x)|y, φ) to jointly learn model and likelihood parameters. "Full" indicates the inclusion of φ, which controls the distribution and induced NLL loss. We focus on full likelihood optimization in this work. We note that the target, y, is the only supervision needed to optimize model and likelihood parameters, θ and φ respectively. Additionally, though the shape and scale varies with φ, reducing the error ŷ -y always reduces the NLL for our distributions. Distributions Under Investigation This work considers the normal likelihood with variance σ (Bishop et al., 2006; Hastie et al., 2009) , the softmax likelihood with temperature τ (Hinton et al., 2015) , and the robust likelihood ρ (Barron, 2019) with shape α and scale σ that control the scale and shape of the likelihood. The first two are among the most common losses in machine learning, and the last loss provides an important illustration of a likelihood parameter that affects "shape" instead of "scale". We note that changing the scale and shape of the likelihood distribution is not "cheating" as there is a trade-off between uncertainty and credit. Figure 1 shows how this trade-off affects the Normal and softmax distributions and their NLLs. The normal likelihood has terms for the residual ŷ -y and the variance σ as N (ŷ|y, σ) = (2πσ 2 ) -1 2 exp - 1 2 (ŷ -y) 2 σ 2 , with σ ∈ (0, ∞) scaling the distribution. The normal NLL can be written N = 1 2σ 2 (ŷ -y) 2 +log σ, after simplifying and omitting constants that do not affect minimization. We recover the squared error by substituting σ = 1. The softmax defines a categorical distribution defined by scores z for each class c as softmax(ŷ = y|z, τ ) = e zyτ c e zcτ , with the temperature, τ ∈ (0, ∞), adjusting the entropy of the distribution. We recover the classification cross-entropy loss, -log p(ŷ = y), by substituting τ = 1 in the respective NLL. We state the gradients of these likelihoods with respect to their σ and τ in Section A of the supplement. The robust loss ρ and its likelihood are 



Figure 1: Optimizing likelihood parameters adapts the loss without manual hyperparameter tuning to balance accuracy and certainty.

(x, α, σ) = |α -2| α   ( x /σ) p (ŷ | y, α, σ) = 1 σZ (α)exp (-ρ (ŷ -y, α, σ)) ,

