MAXIMUM ENTROPY COMPETES WITH MAXIMUM LIKELIHOOD

Abstract

Maximum entropy (MAXENT) method has a large number of applications in theoretical and applied machine learning, since it provides a convenient nonparametric tool for estimating unknown probabilities. The method is a major contribution of statistical physics to probabilistic inference. However, a systematic approach towards its validity limits is currently missing. Here we study MAX-ENT in a Bayesian decision theory set-up, i.e. assuming that there exists a welldefined prior Dirichlet density for unknown probabilities, and that the average Kullback-Leibler (KL) distance can be employed for deciding on the quality and applicability of various estimators. These allow to evaluate the relevance of various MAXENT constraints, check its general applicability, and compare MAX-ENT with estimators having various degrees of dependence on the prior, viz. the regularized maximum likelihood (ML) and the Bayesian estimators. We show that MAXENT applies in sparse data regimes, but needs specific types of prior information. In particular, MAXENT can outperform the optimally regularized ML provided that there are prior rank correlations between the estimated random quantity and its probabilities.

1. INTRODUCTION

The maximum entropy (MAXENT) method was proposed within statistical physics (Jaynes, 1957; Balian, 2007; Pressé et al., 2013) , and later on got a wide range of inter-disciplinary applications in data science, probabilistic inference, biological data modeling etc; see e.g. (Erickson & Smith, 2013) . MAXENT estimates unknown probabilities (that generated data) via maximizing the Boltzmann-Gibbs-Shannon entropy under certain constraints which can be derived from the observed data (Erickson & Smith, 2013) . MAXENT leads to non-parametric estimators whose form does not depend on the underlying mechanism that generated data (i.e. prior assumptions). Also, MAXENT avoids the zero-probability problem, i.e. when operating on a sparse data-so that certain values of the involved random quantity may not appear due to a small, but non-zero probability-MAXENT still provides a controllable non-zero estimate for this small probability. MAXENT has has several formal justification (Jaynes, 1957; Chakrabarti & Chakrabarty, 2005; Baez et al., 2011; Van Campenhout & Cover, 1981; Topsøe, 1979; Shore & Johnson, 1980; Paris & Vencovská, 1997) . But the following open problems are basic for MAXENT, because their insufficient understanding prevents its valid applications. (i) Which constraints of entropy maximization are to be extracted from data, which is necessarily finite and noisy? (ii) When and how these constraints can lead to overfitting, where -due to a noisy data-involving more constraints leads to poorer results? (iii) How predictions of MAXENT compare with those of other estimators, e.g. the (regularized) maximum likelihood? Here we approach these open problems via tools of Bayesian decision theory (Cox & Hinkley, 1979) . We assume that the data is given as an i.i.d. sample of a finite length M from a random quantity with n outcomes and unknown probabilities that are instanced from a non-informative prior Dirichlet density, or a mixture of such densities. Focusing on the sparse data regime M < n we calculate average KL-distances between real probabilities and their estimates, decide on the quality of MAXENT under various constraints, and compare it with the (regularized) maximum-likelihood (ML) estimator. Our main results are that MAXENT does apply to sparse data, but does demand specific prior information. We explored two different scenarios of such information. First, the unknown probabilities are most probably deterministic. Second, there are prior rank correlations between the inferred random quantity and its probabilities. Moreover, in the latter case the nonparametric MAXENT estimator is better in terms of the average KL-distance than the optimally regularized ML (parametric) estimator. Some of above questions were already studied in literature. (Good, 1970; Christensen, 1985; Zhu et al., 1997; Pandey & Dukkipati, 2013) applied formal principles of statistics (e.g. the Minimum Description Length) to the selection of constraints (question (i)). Our approach to studying this question will be direct and unambiguous, since, as shown below, the Bayesian decision theory leads to clear criteria for the validity of MAXENT estimators. We can also compare all predictions with the optimal Bayesian estimator. The latter is normally not available in practice due to insufficient knowledge of prior details, but it still does provide an important theoretical benchmark. Note that (Thomas, 1979; Lebanon & Lafferty, 2002; Kazama & Tsujii, 2005; Altun & Smola, 2006; Dudik, 2007.; Rau, 2011; Campbell, 1999; Friedlander & Gupta, 2005) studied soft constraints that allow incorporation of prior assumptions into the MAXENT estimator making it effectively parametric. Here MAXENT will be taken in its original meaning as providing non-parametric estimators. This paper is organized as follows. Section 2 recall the tenets of the Bayesian decision theory and describes the data-generation set-up. Section 3 introduces and motivates the Bayesian estimator and the regularized ML estimator. Section 4 recalls the basic formulas of MAXENT, applies them to the studied set-up, and discusses their symmetry features. Section 5 compares predictions of MAXENT with the regularized ML. We close in the last section with discussing open problems. Appendix A shows how to apply MAXENT to categorical data. Appendix B presents our preliminary results on the affine symmetry of MAXENT estimators, and establishes relations with the minimum entropy principle proposed in (Good, 1970; Christensen, 1985; Zhu et al., 1997; Pandey & Dukkipati, 2013) .

2. BAYESIAN DECISION THEORY

Consider a random quantity Z with values (z 1 , ..., z n ) and respective probabilities q = (q 1 , ..., q n ) = (q(z 1 ), ..., q(z n )). We look at an i.i.d. sample of length M : D = (Z 1 , ..., Z M ), m = {m k } n k=1 , M ≡ n k=1 m k , where Z u ∈ (z 1 , ..., z n ) (u = 1, ..., M ), and m k is the number of appearances of z k in (1). This sample will be an instance of our data, e.g. constraints of MAXENT will be determined from it. The conditional probability of data D reads P (D|q 1 , ..., q n ) = P (m 1 , ..., m n |q 1 , ..., q n ) = M ! n k=1 q m k k m k ! . To check the performance of various inference methods, the probabilities q(D) = {q k (D)} n k=1 inferred from (1) are compared with true probabilities q = {q(z k )} n k=1 via the KL-distance K[q, q(D)] = n k=1 q k ln q k qk (D) , where concrete forms of q(D) are given below. The choice of distance (3) is motivated below, where we recall that it implies the global optimality of the standard (posterior-mean) Bayesian estimator. Another possible choice of distance is the squared (symmetric) Hellinger distance: dist H [q, q] ≡ 1 -n k=1 √ q k qk . In our situation, it frequently leads to the same qualitative results as (3). How to compare various estimators with each other, and decide on the quality of a given estimator? Bayesian decision theory comes to answer this question; see chapter 11 of (Cox & Hinkley, 1979) . The theory assumes that the probabilities of (z 1 , ..., z n ) are generated from a known probability density P (q 1 , ..., q n ) that encapsulates the prior information about the situation. Next it decides on the quality of an estimator q(D) via the average distance K = n k=1 dq k P (q 1 , ..., q n ) K , K = D P (D|q)K[q, q(D)]. (4)

