MAXIMUM ENTROPY COMPETES WITH MAXIMUM LIKELIHOOD

Abstract

Maximum entropy (MAXENT) method has a large number of applications in theoretical and applied machine learning, since it provides a convenient nonparametric tool for estimating unknown probabilities. The method is a major contribution of statistical physics to probabilistic inference. However, a systematic approach towards its validity limits is currently missing. Here we study MAX-ENT in a Bayesian decision theory set-up, i.e. assuming that there exists a welldefined prior Dirichlet density for unknown probabilities, and that the average Kullback-Leibler (KL) distance can be employed for deciding on the quality and applicability of various estimators. These allow to evaluate the relevance of various MAXENT constraints, check its general applicability, and compare MAX-ENT with estimators having various degrees of dependence on the prior, viz. the regularized maximum likelihood (ML) and the Bayesian estimators. We show that MAXENT applies in sparse data regimes, but needs specific types of prior information. In particular, MAXENT can outperform the optimally regularized ML provided that there are prior rank correlations between the estimated random quantity and its probabilities.

1. INTRODUCTION

The maximum entropy (MAXENT) method was proposed within statistical physics (Jaynes, 1957; Balian, 2007; Pressé et al., 2013) , and later on got a wide range of inter-disciplinary applications in data science, probabilistic inference, biological data modeling etc; see e.g. (Erickson & Smith, 2013) . MAXENT estimates unknown probabilities (that generated data) via maximizing the Boltzmann-Gibbs-Shannon entropy under certain constraints which can be derived from the observed data (Erickson & Smith, 2013) . MAXENT leads to non-parametric estimators whose form does not depend on the underlying mechanism that generated data (i.e. prior assumptions). Also, MAXENT avoids the zero-probability problem, i.e. when operating on a sparse data-so that certain values of the involved random quantity may not appear due to a small, but non-zero probability-MAXENT still provides a controllable non-zero estimate for this small probability. MAXENT has has several formal justification (Jaynes, 1957; Chakrabarti & Chakrabarty, 2005; Baez et al., 2011; Van Campenhout & Cover, 1981; Topsøe, 1979; Shore & Johnson, 1980; Paris & Vencovská, 1997) . But the following open problems are basic for MAXENT, because their insufficient understanding prevents its valid applications. (i) Which constraints of entropy maximization are to be extracted from data, which is necessarily finite and noisy? (ii) When and how these constraints can lead to overfitting, where -due to a noisy data-involving more constraints leads to poorer results? (iii) How predictions of MAXENT compare with those of other estimators, e.g. the (regularized) maximum likelihood? Here we approach these open problems via tools of Bayesian decision theory (Cox & Hinkley, 1979) . We assume that the data is given as an i.i.d. sample of a finite length M from a random quantity with n outcomes and unknown probabilities that are instanced from a non-informative prior Dirichlet density, or a mixture of such densities. Focusing on the sparse data regime M < n we calculate average KL-distances between real probabilities and their estimates, decide on the quality of MAXENT under various constraints, and compare it with the (regularized) maximum-likelihood (ML) estimator. Our main results are that MAXENT does apply to sparse data, but does demand specific prior information. We explored two different scenarios of such information. First, the unknown probabilities are most probably deterministic. Second, there are prior rank correlations between the inferred random quantity and its probabilities. Moreover, in the latter case the nonparametric MAXENT estimator is better in terms of the average KL-distance than the optimally regularized ML (parametric) estimator. Some of above questions were already studied in literature. (Good, 1970; Christensen, 1985; Zhu et al., 1997; Pandey & Dukkipati, 2013) applied formal principles of statistics (e.g. the Minimum Description Length) to the selection of constraints (question (i)). Our approach to studying this question will be direct and unambiguous, since, as shown below, the Bayesian decision theory leads to clear criteria for the validity of MAXENT estimators. We can also compare all predictions with the optimal Bayesian estimator. The latter is normally not available in practice due to insufficient knowledge of prior details, but it still does provide an important theoretical benchmark. Note that (Thomas, 1979; Lebanon & Lafferty, 2002; Kazama & Tsujii, 2005; Altun & Smola, 2006; Dudik, 2007.; Rau, 2011; Campbell, 1999; Friedlander & Gupta, 2005) studied soft constraints that allow incorporation of prior assumptions into the MAXENT estimator making it effectively parametric. Here MAXENT will be taken in its original meaning as providing non-parametric estimators. This paper is organized as follows. Section 2 recall the tenets of the Bayesian decision theory and describes the data-generation set-up. Section 3 introduces and motivates the Bayesian estimator and the regularized ML estimator. Section 4 recalls the basic formulas of MAXENT, applies them to the studied set-up, and discusses their symmetry features. Section 5 compares predictions of MAXENT with the regularized ML. We close in the last section with discussing open problems. Appendix A shows how to apply MAXENT to categorical data. Appendix B presents our preliminary results on the affine symmetry of MAXENT estimators, and establishes relations with the minimum entropy principle proposed in (Good, 1970; Christensen, 1985; Zhu et al., 1997; Pandey & Dukkipati, 2013) .

2. BAYESIAN DECISION THEORY

Consider a random quantity Z with values (z 1 , ..., z n ) and respective probabilities q = (q 1 , ..., q n ) = (q(z 1 ), ..., q(z n )). We look at an i.i.d. sample of length M : D = (Z 1 , ..., Z M ), m = {m k } n k=1 , M ≡ n k=1 m k , where Z u ∈ (z 1 , ..., z n ) (u = 1, ..., M ), and m k is the number of appearances of z k in (1). This sample will be an instance of our data, e.g. constraints of MAXENT will be determined from it. The conditional probability of data D reads P (D|q 1 , ..., q n ) = P (m 1 , ..., m n |q 1 , ..., q n ) = M ! n k=1 q m k k m k ! . To check the performance of various inference methods, the probabilities q(D) = {q k (D)} n k=1 inferred from (1) are compared with true probabilities q = {q(z k )} n k=1 via the KL-distance K[q, q(D)] = n k=1 q k ln q k qk (D) , where concrete forms of q(D) are given below. The choice of distance (3) is motivated below, where we recall that it implies the global optimality of the standard (posterior-mean) Bayesian estimator. Another possible choice of distance is the squared (symmetric) Hellinger distance: dist H [q, q] ≡ 1 - n k=1 √ q k qk . In our situation, it frequently leads to the same qualitative results as (3). How to compare various estimators with each other, and decide on the quality of a given estimator? Bayesian decision theory comes to answer this question; see chapter 11 of (Cox & Hinkley, 1979) . The theory assumes that the probabilities of (z 1 , ..., z n ) are generated from a known probability density P (q 1 , ..., q n ) that encapsulates the prior information about the situation. Next it decides on the quality of an estimator q(D) via the average distance K = n k=1 dq k P (q 1 , ..., q n ) K , K = D P (D|q)K[q, q(D)]. ( ) where K is the average of (3) over samples (1) with fixed length M . Sometimes the Bayesian decision theory replaces the distance by the utility, loss etc (Cox & Hinkley, 1979) . Note the difference between the proper Bayesian approach and the Bayesian decision theory; cf. chapters 10 and 11 in (Cox & Hinkley, 1979) . The former employs the data for moving from the prior (5) to the posterior (7). It averages over the prior, e.g. when calculating the posterior mean. The latter advises on choosing estimators, whose form may or not may not depend on the prior; see below for examples. The decision theory averages both over the data and over the prior, as seen in ( 4). For the prior density of q = {q k } n k=1 we choose the Dirichlet density (or a mixture of such densities as seen below) (Frigyik et al., 2010; Schafer, 1997) : P (q 1 , ..., q n ; α 1 , ..., α n ) = Γ[ n k=1 α k ] n k=1 Γ[α k ] n k=1 q α k -1 k δ( n k=1 q k -1), where Γ[x] = ∞ 0 dy y x-1 e -y is Euler's Γ-function and delta-function δ( n k=1 q k -1) ensures the normalization of probabilities. Parameters α k > 0 determine the prior weight of q k (Frigyik et al., 2010; Schafer, 1997) : q k ≡ ∞ 0 n l=1 dq l q k P (q 1 , ..., q n ; α 1 , ..., α n ) = α k A , A ≡ n k=1 α k , where the integration range goes over the simplex 0 ≤ q k ≤ 1, ∀k, and n k=1 q k = 1. Dirichlet density ( 5) is unique in holding several desired features of non-informative prior density over unknown probabilities; see (Frigyik et al., 2010; Schafer, 1997) for reviews. An important feature of density ( 5) is that it is conjugate to the multinomial conditional probability (2) P (q 1 , ..., q n |m 1 , ..., m n ) = P (q 1 , ..., q n ; α 1 + m 1 , ..., α n + m n ). Eq. ( 7) is convenient when studying i.i.d. samples (1) of discrete random quantities. Here we assume that the prior density is known exactly [see however (32) ]. In practice, such a knowledge need not be available. For example, it may be known that the prior density belongs to the Dirichlet family, but its hyper-parameters {α k } n k=1 are unknown and should be determined from the data, e.g. via empirical Bayes procedures; see (Frigyik et al., 2010; Schafer, 1997; Claesen & De Moor, 2015; Ran & Hu, 2017; Bergstra & Bengio, 2012) for reviews on hyper-parameter estimation.

3. BAYESIAN AND REGULARIZED MAXIMUM LIKELIHOOD (ML) ESTIMATORS

Starting from (4), we find the best estimator in terms of the minimal, average KL-distance: min[ K ] = D P (D) min n k=1 dq k P (q|D)K[q, q(D)] , where the minimization goes over inferred probabilities {q(D)}, and where P (q|D) is recovered from P (D|q): P (D)P (q|D) = P (D|q)P (q); cf. (1, 2). The equality in (8) follows from the fact that if q(D) minimizes n k=1 dq k P (q|D)K[q, q(D)], then it will minimize each term of the sum for every D, and thus will minimize the whole sum. Then implementing the constraint n k=1 qk (D) = 1 via a Lagrange multiplier, we get from (8): argmin n k=1 dq k P (q|D) K[q, q(D)] = n k=1 dq k q l P (q|D) n l=1 . We got in (9) the posterior average, because we employed the KL distance K[q, q(D)]. The optimal estimator will be different upon using another distance, e.g. KL distance K[q(D), q] of q(D) from q, or the Hellinger distance. Note that in the proper Bayesian approach the posterior mean is simply postulated to be an estimator, since it is just a characteristics of the posterior distribution. In the present Bayesian decision approach the posterior emerges from minimizing a specific (viz. KL) distance. If another distance is used, the posterior mean is not anymore optimal. If the prior is a single Dirichlet density (5) we get from (7, 9) for the Bayesian estimator: p(z k ) = m k + α k M + A . ( ) The average KL-distance (4) for the estimator (10) reads from (7, 2) (denoting ψ[x] ≡ d dx ln Γ[x]): K[q, p] = 1 A n k=1 α k ψ(1 + α k ) -ψ(1 + A) + ln(M + A) - Γ[M + 1] Γ[A] Γ[M + A + 1] n k=1 M m=0 Γ[m + 1 + α k ] Γ[M -m + A + α k ] ln(m + α k ) Γ[α k ] Γ[A -α k ] Γ[m + 1] Γ[M -m + 1] . ( ) If the prior density is given by mixture of Dirichlet densities with weights {π a } L a=1 : L a=1 π a P (q 1 , ..., q n ; α [a] 1 , ..., α [a] n ), L a=1 π a = 1, then instead of ( 6) and ( 10) we have from ( 9) q k = L a=1 π a α [a] k A [a] , A [a] ≡ n k=1 α [a] k , p(z k ) = L a=1 π a Φ [a] m k +α [a] k M +A [a] L a=1 π a Φ [a] , Φ [a] ≡ Γ[A [a] ] Γ[M + A [a] ] n k=1 Γ[m k + α [a] k ] Γ[α [a] k ] . (14) For a mixture prior density, the Bayesian estimator ( 14) depends on all numbers {m k ; α [1] k , ..., α [L] k } not just on m k . Below we illustrate that not knowing precisely details of the prior mixture can lead to serious losses when applying Bayesian estimators. It is interesting (both conceptually and practically) to have a simple estimator, where the dependence on the prior is reduced to a single parameter. A good candidate is the regularized maximum likelihood (ML) estimator (see (Hausser & Strimmer, 2009) for a review): p ML (z k ) ≡ m k + b M + nb = λ m k M + (1 -λ) 1 n , λ = M M + nb , b ≥ 0, 0 < λ < 1, where the regularizer b (or λ) takes care of the fact that for a finite sample ( 1) not all values z k had a chance to appear (i.e. m k = 0 for them). Then (15) avoids to claim a zero probability due to b > 0. Eq. ( 15) is a shrinkage estimator, where the proper ML estimator m k M is shrunk towards uniform distribution 1 n by the shrinkage factor λ. The proper ML estimator p ML (z k )| b=0 will be shown to be a meaningless estimator for not very long samples (1) producing results that are worse than {q(z k ) = 1 n } n k=1 . Moreover, for such samples the correct choice of b (based on the prior information) is crucial, i.e. ( 15) is generally a parametric estimator. The estimator (15) recovers true probabilities for M → ∞ (Cox & Hinkley, 1979) , where n and b are fixed, hence λ → 1 in (15). For the optimal estimator (15), the value of b is found by minimizing the average KL-distance (4). When the prior is given by a Dirichlet density (5), the average KL-distance amounts to (11), where we need to replace ln(M +A) → ln(M +nb) and ln(m+α k ) → ln(m+b ). Now (9, 10) imply that for a homogeneous Dirichlet prior, i.e. for (5) with α k = α, we have b opt = α for the optimal value of b, i.e. the regularized ML estimator coincides with the Bayesian estimator: p ML (z k ) = p(z k ). This does not anymore hold for the mixture of Dirichlet prior densities.

4. THE MAXIMUM ENTROPY (MAXENT) METHOD

MAXENT infers probabilities from maximizing the Boltzmann-Gibbs-Shannon entropy S[q] = - n k=1 q(z k ) ln q(z k ), under constraints taken from the sample (1). The rationale of maximizing ( 16) is that a larger S means a smaller bias (or information) according to several axiomatic schemes (Jaynes, 1957; Chakrabarti & Chakrabarty, 2005; Baez et al., 2011; Van Campenhout & Cover, 1981; Topsøe, 1979; Shore & Johnson, 1980; Paris & Vencovská, 1997; Balian, 2007; Pressé et al., 2013) . Note that physical applications of MAXENT operate with constraints that are known precisely, e.g. the mean energy constraint is deduced from the corresponding conservation law (Jaynes, 1957; Balian, 2007; Pressé et al., 2013) . Such situations are rare in statistics and machine learning. Hence we need to understand which constraints are to be taken from the noisy data. First we can apply no constraint and maximize the entropy: q [0] (z k ) = 1/n. ( ) The calculation of the average distance is straightforward from (4, 11, 17) both for a single Dirichlet prior and a mixture of such priors. We examplify the single Dirichlet case (5): K[q, q [0] ] = n k=1 q k ln q k + ln n = 1 A n k=1 α k ψ(1 + α k ) -ψ(1 + A) + ln n. ( ) Now K[q, q [0] ] plays an important role: once ( 17) is completely data-independent and simply reproduced the prior expectation on the unbiased probabilities, estimators that provide the average KL-distance larger than (18) are meaningless; see below for examples. Next, we employ the empiric mean of (1) as a constraint for the expected value of Z: µ 1 = 1 M M u=1 Z u = 1 M n k=1 z k m k = n k=1 q k z k . Maximizing ( 16) under constraint ( 19) via the Lagrange method leads to the famous Gibbs formula (Jaynes, 1957; Balian, 2007; Pressé et al., 2013) : q [1] (z k ) = e -βz k n l=1 e -βz l , where the Lagrange multiplier β is found from (19). Appendix presents an example of applying (20) to real data. We order the values of Z as z 1 < ... < z n , and note a specific feature of (20): depending on the sign of β, we either get q [1] (z 1 ) ≤ ... ≤ q [1] (z n ) or q [1] (z 1 ) ≥ ... ≥ q [1] (z n ). One can try to acquire further information from sample (1) by looking at the second empiric moment: µ 2 = 1 M M u=1 Z 2 u = 1 M n k=1 z 2 k m k = n k=1 q k z 2 k . Now we maximize (16) under two constraints ( 19) and ( 22): q [1+2] (z k ) = e -β1z k -β2z 2 k n l=1 e -β1z l -β2z 2 l , where Lagrange multipliers β 1 and β 2 are found from solving both ( 19) and ( 22). Eqs. (20, 23) make obvious how to involve other (fractional) moments. The maximizations of ( 16) lead to unique results, because ( 16) is a concave function of {p k } n k=1 , while the moment constraints are linear. Let the values (z 1 , ..., z n ) of Z be subject to affine transformation z k = F(z k ), F(z) = gz + h, k = 1, ..., n. Hence as a result of transformation (24): 19, 22) . These relations show that (24) leaves the inferred probabilities (20, 23) invariant, because the resulting set of equations for the unknowns in (20, 23) are identical for both the original (z 1 , ..., z n ) and transformed values (24). Likewise, involving first p moments µ 1 , ..., µ p produces affine-invariant probabilities. Note that involving only ( 22) [without involving (19) ] will lead to the invariance of the probabilities with respect to a limited affine-symmetry, where h = 0 in (24). Another example of limited affine symmetry is involving the fractional moment n k=1 q k √ z k (for z k ≥ 0 and instead of (19, 22)). Then the probabilities q [1/2] (z k ) ∝ e -β 1/2 √ z k will stay intact only under h = 0 and g > 0 in (24). Note in this context that the ML estimator ( 15) is invariant with respect (24) with an arbitrary bijective F, which keeps the values of z k different. µ 1 → µ 1 = gµ 1 + h and µ 2 → µ 2 = g 2 µ 2 + h 2 + 2ghµ 1 ; see ( The symmetry features of various estimators are clearly important, though we so far have no analytical results that would relate them to the estimation quality quantified by (4). But we noted from numerical comparison of MAXENT estimators based on various constraints, that estimators with the largest affine symmetry-i.e. ( 24) with arbitrary g and h-tend to be better in terms of the average KL-distance (4). Intuitively, higher (affine) symmetry should be related to higher susceptibility with respect to noises; see Appendix B for further results.

5.1. A SINGLE DIRICHLET DENSITY

Recall that maximization of entropy ( 16) can be applied if there is no prior information that distinguishes one probability from another. If such information is present, MAXENT is generalized to the minimum relative entropy method (Shore & Johnson, 1980) . We shall not study this generalization here. Hence to ensure applicability of MAXENT, we always choose prior densities such that q k = 1 n ; i.e. all n values are equally likely to be generated, on average. As seen from ( 6), for a single prior Dirichlet density (5) condition q k = 1 n implies: α k = α, k = 1, ..., n, Now recall from (15, 10) that under (25) the Bayesian and the regularized ML coincide. I.e. we conclude that the regularized ML is a better estimator than MAXENT (under any constraint). Though q k = 1 n does not depend on α, the most probable values q k of q k do depend on the magnitude of α. Finding q k from (5, 25) amounts to maximizing L(q) = (α -1) n k=1 ln q k + γ n k=1 q k , where the Lagrange multiplier γ ensures n k=1 q k = 1. For α > 1, L(q) is a concave function of q, and its global maximum is found after differentiating it. Hence q k holds q k = 1/n for α > 1, k = 1, ..., n. (26) For α < 1, L(q) is a convex function, it does not have local maxima with q k > 0 (k = 1, ..., n). Its maxima are located at points, where q k = 0 for certain k. Repeating this argument, we see that the maxima of L(q) are at those points, where a possibly large number of q k are zero: q k = 0 or q k = 1, for α < 1, k = 1, ..., n, which means deterministic probabilities. Eq. ( 27) is consistent with q k = 1/n, because there are n equivalent most probable values. Let us start with the regime α > 1; cf. (26). 22) is again estimated from the noisy data and gathers more noise than information. This overfitting disappears for larger values of M , i.e. M ≥ 25, as Table 2 demonstrates. Now adding the second constraint ( 22) is beneficial. Table 1 : For n = 60 and z k = k (k = 1, ..., n) we show the average KL-distance (4) for various estimators. The full affine symmetry ( 24) holds for all shown probabilities. M is the length of sample (1). The initial prior Dirichlet density ( 5) holds ( 25) with α k = 2. Eq. ( 18) equals K[q, q [0] ] = 0.225, i.e. values of the average KL-distance larger than 0.225 are meaningless. K Bayes is the averaged KL-distance for the Bayes estimator (10) that for this case coincides with the optimally regularized ML estimator. K 1 and K 1+2 are defined (resp.) via (20, 19) and (23, 22) . The averages are found numerically (applies to all Tables): first we generate 10 2 instances of {q k } n k=1 from the Dirichlet density, and then for each instance we generate 10 2 samples (1). Such parameters lead to 3-digit precision, as reported. 1 , but the initial prior density is a Dirichlet mixture given by (12, 30) with α 0 = 0.3 and = 1.1. The average KL-distance K[q, q [0] ] for the trivial estimator (17) equals 0.212, i.e. values of the average KL-distance larger than 0.212 are meaningless; cf. ( 18). K Bayes and K Bayes refer to ( 14) and (32), respectively. K ML b=bopt and K ML b=1 refer to regularized ML estimator (15) under b = 1 and the optimal value of b found from numerically minimizing K ML . The optimal value b opt of b changes from 2.46 for M = 35 to 2.65 for M = 1. We also report the value of K ML b=1 with a sensible value of b to confirm that if b is not chosen properly, then the corresponding (regularized) ML estimator ( 15) is meaningless. K 1 is defined via (4, 20). 19, 22) . The Bayesian estimator is found from (14, 30). For this range of sufficiently large M the MAXENT estimator (23) performs better than (23): For modeling more complex types of prior information about the unknown probabilities {q k } n k=1 , we shall assume that the prior density is a mixture of two Dirichlet densities; see (12). Relations q k = 1 n (k = 1, ..., n) will be still kept, since they are necessary for applying MAXENT. Now we assume that that there are (prior) conditional rank correlation between the values (z 1 , ..., z n ) of Z-ordered as (z 1 < ... < z n )-and its probabilities (q 1 , ..., q n ). For one component of the mixture, the probabilities (q 1 , ..., q n ) prefer to be ordered as in (q 1 < ... < q n ). For another component they tend to be ordered in the opposite way (q 1 > ... > q n ). This type of prior knowledge can be modeled via a mixture (12) of two Dirichlet priors with L = 2, π 1 = π 2 = 1 2 , and α M K Bayes K 1 K 1+2 K 1+2 is not shown, since K 1+2 > K 1 for 35 ≥ M ≥ 1. We do not show K 1 | M =1 , K 1+2 < K 1 < K ML b=bopt . M K Bayes K ML b=bopt K ML b=1 K 1 K 1+2 [1] 1 < ... < α [1] n , α [2] 1 > ... > α [2] n , [1] k -α [1] l A [1] = α [2] l -α [2] k A [2] , for any k, l = 1, ..., n, where (29) ensures the needed q k = 1 n , as seen from ( 13). A simple case that leads to (28, 29) is L = 2, π 1 = π 2 = 1 2 , α [1] k = α 0 + (k -1), α [2] k = α 0 + (n -k), k = 1, ..., n, where A [1] = A [2] = α 0 n + n(n-1)

2

. Recall that for a mixture of Dirichlet densities the Bayes estimator ( 14) and the optimally regularized ML estimator (15) are different. For numerical illustration we choose {z k = k} n k=1 . Prior probability densities generated via (30) will be employed with z 1 < ... < z n . Now Tables 3 and 4 show that for M ≥ 5 the MAXENT estimator (20, 19) is clearly better than the optimally regularized ML estimator (15): K 1 < K ML b=bopt , where the optimal value of b is found from minimizing the averaged KL-distance (4). Moreover, for M ≥ 7, we see that K 1 is closer to the optimal K Bayes than to K ML b=bopt . Note that such threshold values for M do depend on the assumed prior density and on n. For M → ∞ the performance of the optimally regularized ML estimator (15) (for a fixed b ∼ 1) will be better than MAXENT with any finite number of constraints, since the regularized ML converges to the true probabilities for M → ∞ (Cox & Hinkley, 1979), while MAXENT does not. But as Table 4 shows, MAXENT with constraints (19) or ( 19)+( 22) still performs better than the optimally regularized ML even for M as large as 241 (for n = 60). Table 3 shows that MAXENT with two constraints (19, 22) performs worse than the method under the single constraint (19) although the affine invariance (24) of probabilities holds. This overfitting situation changes for larger values of M , i.e. M ≥ 45, as Table 4 demonstrates. To stress the relevance of rank correlations, we note that the advantage (31) of MAXENT closely relates to the agreement between (28) and the ordering (z 1 < ... < z n ) of Z. If the vector (z 1 < ... < z n ) is randomly permuted, and employed for values of Z, predictions of MAXENT become meaningless even for rather large values of M > n. Recall that both the Bayesian ( 14) and the regularized ML estimator (15) are parametric, i.e. the very their form depends on the prior, which is frequently not available in practice. Hence we need to understand how strong is dependence. Let us assume that one has to employ a Bayesian estimator without knowing the full form of the equal-weight mixture (30). Instead one knows the average values of 30) , prescribes them to a single Dirichlet prior (5, 25) and builds up from (10) an estimator α k = 1 2 [α 0 + (k -1)] + 1 2 [α 0 + (n -k)] from ( p(z k ) = m k + α 0 + (n -1)/2 M + nα 0 + n(n -1)/2 . ( ) The performance of this perturbed Bayesian estimator deteriorates and gets worse than that of the MAXENT solution: K 1 < K Bayes ; see Table 3 . Likewise, the choice of b in the regularized ML estimator ( 15) is important. If just some reasonable value is chosen instead of the optimal one-e.g. b = 1 instead of b ∼ 2.5 in Table 3 -then the ML estimator can turn meaningless; see Table 3 for M ≤ 15. In this context, we emphasize that the Bayes estimator and the optimally regularized ML estimator are never meaningless even for M = 1. For parameters of Table 3 , the MAXENT estimator (20) becomes meaningless for M ≤ 3.

6. SUMMARY AND DISCUSSION

The maximum entropy (MAXENT) method provides non-parametric estimators for inferring unknown probabilities (Erickson & Smith, 2013) . MAXENT is widely applied both in statistical physics and probabilistic inference. However, its physical applications are mostly data-free and are based on additional principles (e.g. conservation laws (Balian, 2007; Pressé et al., 2013) ) that are normally absent in statistics and machine learning. Hence we needed a systematic approach towards understanding the validity limits of MAXENT as an inference tool. Here we presented a Bayesian decision theory approach that allows to determine on whether MAX-ENT is applicable at all, i.e. whether it is better than a random guess. It also allows to compare different estimators with each other (e.g. to compare MAXENT with the regularized maximum likelihood), and study the relevance of various constraints employed in MAXENT. Our results are summarized as follows. MAXENT does apply to a sparse data, but demands specific prior information. Here sparse means M < n, i.e. the sample length M is smaller than the number of probabilities n to be inferred. We explored two different scenarios of such prior information. First, the unknown probabilities generated by homogeneous Dirichlet density (25) are most probably deterministic. Second, there are prior rank correlations between the random quantity and its probabilities. This seems to be the simplest prior information that makes MAXENT applicable and superior over the optimally regularized maximum-likelihood estimator. Our approach is capable of describing several phenomena that are relevant for applying and understanding estimators: overfitting (i.e. adding more noisy constraints leads to poorer inference), instability of optimal Bayesian parametric estimators with respect to variation of prior details, inapplicability of non-parametric MAXENT estimators to very short samples etc. Several important problems were uncovered by this study and should be addressed in future. First of all, this concerns the applicability of MAXENT to a categorical data, where the values (z 1 , ..., z n ) of the random variable Z in sample (1) are not numerical, but instead refer to certain distinguishable categories. The major difference between maximum likelihood and MAXENT estimator is that the former freely applies to categorical data. In contrast, MAXENT does depend on the concrete numerical implementation (i.e. encoding) of data, though this dependence is somewhat weakened by the affine symmetry (24) . Thus an open problem demands considering various encoding schemes in view of their applicability to MAXENT estimators. (In this paper we in fact assumed the simplest encoding via natural numbers; see Tables.) Appendix A reports preliminary results in this direction along with a real data example. The second open problem relates to the influence of affine symmetries on the performance of various MAXENT estimators. We observed numerically that the constraints which produce affine-invariant probabilities produce better estimators; see after (24). Preliminary results along this direction are given in Appendix B, where we also show relations of our results with the minimum entropy principle proposed in (Good, 1970; Christensen, 1985; Zhu et al., 1997; Pandey & Dukkipati, 2013) for contraint selection. with α 1 = α 2 = α 3 = 1, and then using this sample as category probabilities, categorical data sets of size M are generated with categories replaced by its frequency encodings. The process is repeated and the average K 1 from ( 4) is computed via generating 10 3 instances of {q k } 3 k=1 and 10 3 categorical samples. For the present case α 1 = α 2 = α 3 = 1 and n = 3, we have K[q, q [0] ] = 0.265 from (18). Now for M > 15 we get that K 1 < K[q, q [0] ] , i.e. the MAXENT solution is reliable. For example, at M = 17 we have K Bayes = 0.046 < K 1 = 0.116 < K[q, q [0] ] = 0.265, where K Bayes refers to the Bayesian (posterior mean) estimator (10). Already for M = 47 predictions of MAXENT are close to those of the optimal Bayesian estimator K Bayes = 0.019 < K 1 = 0.034. For the actual sample size M = 1447, we get even closer results K Bayes = 0.00087 < K 1 = 0.014. To summarize the present real categorical data example, we saw that the frequency encoding of the categorical variable allows to apply MAXENT. The MAXENT estimator (19) agrees with Bayesian estimator, and is going to be reliable already for modest sample sizes M > 15. For a sufficiently large M the average KL distance of the MAXENT estimator gets close to that of the (optimal) Bayes estimator.

B APPENDIX: AFFINE SYMMETRY AND THE MINIMUM ENTROPY PRINCIPLE

Above we focused on MAXENT estimators (20, 19) (the first empiric moment is fixed) or (23, 19, 22) (the first and second empiric moments are fixed). As discussed around (24), both ( 19) and ( 22) lead to affine-invariant probabilities. We studied several alternative constraints that do not have the full affine symmetry, i.e. this symmetry is partial and relates to restriction on the parameters in ( 24). An example of this is constraining the square-root (fractional) moment [cf. (20, 23) ] q [1/2] (z k ) = e -β 1/2 √ z k n l=1 e -β 1/2 √ z l , ( ) n k=1 q k √ z k = 1 M M u=1 Z u , where β 1/2 is determined from (34), and where we assumed z k > 0. For estimator (33) the symmetry ( 24) is kept under g > 0 and h = 0. We denote the corresponding average KL distances by K 1/2 . Let us now compare two different MAXENT estimators each one employing its own constraint; e.g. we compare (20) with (33). We saw from extensive numeric simulations that whenever these constraints have different degrees of the affine symmetry, then the estimator having the largest symmetry provides a smaller average KL distance. A particular example of this general relation is: K 1 < K 1/2 , which was verified on parameters of Tables 1 2 3 4 . Recall that Refs. (Good, 1970; Christensen, 1985; Zhu et al., 1997) proposed the minimum entropy principle: when comparing two possible contraints to be employed in the maximum entropy method, then it is preferable to use the one that provides the smaller (maximized) entropy. The heuristic motivation of the principle is that it avoids overfitting by not insisting too much on the entropy maximization. This principle was motivated via the minimum description length in (Pandey & Dukkipati, 2013) . We ask whether in cases similar to (35) we can compare the average entropies, i.e. for (35) we compare S[q [1] (z k )] and S[q [1/2] (z k )] , where the averages are defined as in (4). In all cases we were able to check, relations similar to (35) are accompanied by the result that the constraint which provide a smaller average KL distance (i.e. a better costraint) also has a smaller average entropy, e.g.

S[q

[1] (z k )] < S[q [1/2] (z k )] . The theoretical origin of this relation between the average KL distance and the average (maximized) entropy is not yet clear. Here is a concrete numerical example that illustrates (35, 36) . For parameters of Table 2 we noted for M = 55: K 1 = 1.756 < K 1/2 = 1.758 and S[q [1] (z k )] = 4.006 < S[q [1/2] (z k )] = 4.008.



Bayes | M =1 = 0.224 < 0.225.

The same as in Table1, but for α k = α = 0.1 in (25). Eq. (18) gives K[q, q [0] ] = 1.798, i.e. values of the average KL-distance larger than 1.798 are meaningless.

The same as in Table

since it is larger than the average KL-distance for all other estimators. The same as in Table3, but for different values of M . Here K 1+2 refers to MAXENT estimator (23) with constraints (

A APPENDIX: MAXENT APPLIES TO CATEGORICAL DATA

The MAXENT method can be applied to any multinomial data (1), provided that numeric values (z 1 , ..., z n ) of the random quantity Z are given. MAXENT estimators depend on (z 1 , ..., z n ) modulo the affine symmetry (24). This creates a problem in applying MAXENT to categorical data, since for MAXENT one now needs a specific encoding of categorical Z into a numeric representation (z 1 , ..., z n ) of categories. Recall that there is a degree of arbitrariness in choosing the regularizer in maximum likelihood (ML) estimator, or in choosing prior parameters for Bayesian inference. Here the arbitrariness lies in different encodings. In practice, the proper encoding of categories arises in any problem dealing with categorical data. If categories are ordinal (e.g. military ranks, education levels), then one can use z 1 < z 2 < ... < z n encoding. However, for nominal categories (e.g. ethnicity, preference, disease) there is no such ordering.Let us illustrate the MAXENT method with the simple data from pre-election presidential polling conducted in 1988, where out of M = 1447 voters m 1 = 727 preferred Bush, m 2 = 583 preferred Dukakis, and m 3 = 137 preferred other candidates or had no preference. The data together with its Bayesian analysis is taken from (Gelman et al., 2013) .Here our random variable Z is voter's preference with three outcomes (z 1 , z 2 , z 3 ) = ('Bush', 'Dukakis', 'Other') and unknowns (q 1 , q 2 , q 3 ), which just represent the fractions of the population with each preference. The goal here is to estimate q 1 -q 2 , i.e. whether Bush has more support than Dukakis. One can assume the Dirichlet noninformative prior distribution for (q 1 , q 2 , q 3 ) with parameters α 1 = α 2 = α 3 = 1, compute the posterior means of q 1 and q 2 (q 1 , q2 ) and take the difference (Gelman et al., 2013) . The results show that Bush has more support: q1 > q2 .Since the data is purely categorical, we shall apply the frequency encoding for MAXENT: each category is represented with its frequency in the data set, e.g. in this example (z 1 , z 2 , z 3 ) = (0.502, 0.403, 0.095). Now empiric mean is equal to 0.42 and the maximizing solutions of ( 16) with 3 k=1 q k z k = 0.42 are (q [1] (z 1 ), q [1] (z 2 ), q [1] (z 3 )) = (0.535, 0.36, 0.105). Thus, also the MAXENT result shows more support for Bush.To see if the prediction of MAXENT is reliable (on average) here, the same Bayesian decision model for these samples is set up, where first a sample of (q 1 , q 2 , q 3 ) is drawn from the Dirichlet distribution

