MAXIMUM ENTROPY COMPETES WITH MAXIMUM LIKELIHOOD

Abstract

Maximum entropy (MAXENT) method has a large number of applications in theoretical and applied machine learning, since it provides a convenient nonparametric tool for estimating unknown probabilities. The method is a major contribution of statistical physics to probabilistic inference. However, a systematic approach towards its validity limits is currently missing. Here we study MAX-ENT in a Bayesian decision theory set-up, i.e. assuming that there exists a welldefined prior Dirichlet density for unknown probabilities, and that the average Kullback-Leibler (KL) distance can be employed for deciding on the quality and applicability of various estimators. These allow to evaluate the relevance of various MAXENT constraints, check its general applicability, and compare MAX-ENT with estimators having various degrees of dependence on the prior, viz. the regularized maximum likelihood (ML) and the Bayesian estimators. We show that MAXENT applies in sparse data regimes, but needs specific types of prior information. In particular, MAXENT can outperform the optimally regularized ML provided that there are prior rank correlations between the estimated random quantity and its probabilities.

1. INTRODUCTION

The maximum entropy (MAXENT) method was proposed within statistical physics (Jaynes, 1957; Balian, 2007; Pressé et al., 2013) , and later on got a wide range of inter-disciplinary applications in data science, probabilistic inference, biological data modeling etc; see e.g. (Erickson & Smith, 2013) . MAXENT estimates unknown probabilities (that generated data) via maximizing the Boltzmann-Gibbs-Shannon entropy under certain constraints which can be derived from the observed data (Erickson & Smith, 2013) . MAXENT leads to non-parametric estimators whose form does not depend on the underlying mechanism that generated data (i.e. prior assumptions). Also, MAXENT avoids the zero-probability problem, i.e. when operating on a sparse data-so that certain values of the involved random quantity may not appear due to a small, but non-zero probability-MAXENT still provides a controllable non-zero estimate for this small probability. MAXENT has has several formal justification (Jaynes, 1957; Chakrabarti & Chakrabarty, 2005; Baez et al., 2011; Van Campenhout & Cover, 1981; Topsøe, 1979; Shore & Johnson, 1980; Paris & Vencovská, 1997) . But the following open problems are basic for MAXENT, because their insufficient understanding prevents its valid applications. (i) Which constraints of entropy maximization are to be extracted from data, which is necessarily finite and noisy? (ii) When and how these constraints can lead to overfitting, where -due to a noisy data-involving more constraints leads to poorer results? (iii) How predictions of MAXENT compare with those of other estimators, e.g. the (regularized) maximum likelihood? Here we approach these open problems via tools of Bayesian decision theory (Cox & Hinkley, 1979) . We assume that the data is given as an i.i.d. sample of a finite length M from a random quantity with n outcomes and unknown probabilities that are instanced from a non-informative prior Dirichlet density, or a mixture of such densities. Focusing on the sparse data regime M < n we calculate average KL-distances between real probabilities and their estimates, decide on the quality of MAXENT under various constraints, and compare it with the (regularized) maximum-likelihood (ML) estimator. Our main results are that MAXENT does apply to sparse data, but does demand specific prior information. We explored two different scenarios of such information. First, the

