GENERALIZED BELIEF TRANSPORT

Abstract

Human learners have ability to adopt appropriate learning approaches depending on constraints such as prior on the hypothesis and urgency of decision. However, existing learning models are typically considered individually rather than in relation to one and other. To build agents that have the ability to move between different modes of learning over time, it is important to understand how learning models are related as points in a broader space of possibilities. We introduce a mathematical framework, Generalized Belief Transport (GBT), that unifies and generalizes prior models, including Bayesian inference, cooperative communication and classification, as parameterizations of three learning constraints within Unbalanced Optimal Transport (UOT). We visualize the space of learning models encoded by GBT as a cube which includes classic learning models as special points. We derive critical properties of this parameterized space including proving continuity and differentiability which is the basis for model interpolation, and study limiting behavior of the parameters, which allows attaching learning models on the boundaries. Moreover, we investigate the long-run behavior of GBT, explore convergence properties of models in GBT mathematical and computationally, and formulate conjectures about general behavior. We conclude with open questions and implications for more unified models of learning.

1. INTRODUCTION

Learning and inference are subject to internal and external constraints. Internal constraints include the availability of relevant prior knowledge, which may be brought to bear on inferences based on data. External constraints include the availability of time to accumulate evidence or make the best decision now. Human learners appear to be capable of moving between these constraints as necessary. However, standard models of machine learning tend to view constraints as different problems, which impedes development of unified view of learning agents. Indeed, internal and external constraints on learning map onto classic dichotomies in machine learning. Internal constraints such as the availability of prior knowledge maps onto the Frequentist-Bayesian dichotomy in which the latter uses prior knowledge as a constraint on posterior beliefs, while the former does not. Within Bayesian theory, a classic debate pertains to uninformative, or minimally informative, settings of priors (Jeffreys, 1946; Robert et al., 2009) . External constraints such as availability of time to accumulate evidence versus the need to make the best possible decision now informs the use of generative versus discriminative approaches (Ng and Jordan, 2001) . Despite the fundamental nature of these debates, and the usefulness of all approaches in the appropriate contexts, we are unaware of prior efforts to unify these perspectives and study the full space of possible models. We introduce Generalized Belief Transport (GBT), based on Unbalanced Optimal Transport (Sec. 2), which paramterizes and interpolates between known reasoning modes (Sec. 3.2), with four major contributions. First, we prove continuity in the parameterization and differentiability on the interior of the parameter space (Sec. 3.1). Second, we analyze the behavior under variations in the parameter space (Sec. 3.3) . Third, we consider sequential learning, where learners may (not) track the empirically observed data frequencies. And finally we state our theoretical results, simulations and conjectures about the sequential behaviors for various parameters for generic costs and priors (Sec. 4.2). Notations. R ≥0 denotes the non-negative reals. Vector 1 = (1, . . . , 1). The i-th component of vector v is v(i). P(A) is the set of probability distributions over A. For a matrix M , M ij represents its (i, j)-th entry, M (i,_) denotes its i-th row, and M (_,j) denotes its j-th column. Probability is P( • ).

2. LEARNING AS A PROBLEM OF UNBALANCED OPTIMAL TRANSPORT

Consider a general learning setting: an agent, which we call a learner, updates their belief about the world based on observed data subject to constraints. There is a finite set D = {dfoot_0 , . . . , d n } of all possible data, that defines the interface between the learner and the world. The world is defined by a true hypothesis h * , whose meaning is captured by a probability mapping P(d|h * ) onto observable data. For instance, the world can either be the environment in classic Bayesian inference (Murphy, 2012) or a teacher in cooperative communication (Wang et al., 2020b) . A learner is equipped with a set of hypotheses H = {h 1 , . . . , h m } which may not contain h * ; an initial belief on the hypotheses set, denoted by θ 0 ∈ P(H); and a non-negative cost matrix C = (C ij ) m×n , where C ij measures the underlying cost of mapping d i into h j 1 . The cost matrix can be derived from other matrices that record the relation between D and H, such as likelihood matrices in classic Bayesian inference or consistency matrices in cooperative communication (see details in Section 3.2).This setting reflects an agent's learning constraints: pre-selected hypotheses, and the relations between them and the communication interface (data set). A learner observes data in sequence. At round k, the learner observes a data d k that is sampled from D by the world according to P(d|h * ). Then the learner updates their beliefs over H from θ k-1 to θ k through a learning scheme, where θ k-1 , θ k ∈ P(H). For instance, in Bayesian inference, the learning scheme is defined by Bayes rule; while in discriminative models, the learning scheme is prescribed by a code book. The learner transforms the observed data into a belief on hypotheses h ∈ H with a minimal cost, subject to appropriate constraints, with the goal of learning the exact map P(d|h * ). We can naturally cast this learning problem as Unbalanced Optimal Transport.

2.1. UNBALANCED OPTIMAL TRANSPORT

Unbalanced optimal transport is a generalization of (entropic) Optimal Transport. Optimal transport infers a coupling that minimizes the cost of transporting between two marginal probability distributions (Monge, 1781; Kantorovich, 2006; Villani, 2008) . Entropic Optimal Transport adds a regularization term based on the entropy of the inferred coupling, which has desirable computational consequences (Cuturi, 2013; Peyré and Cuturi, 2019) . Unbalanced OT further relaxes the problem by allowing one to approximately match marginal probability distributions. Let η = (η(1), . . . , η(n)) and θ = (θ(1), . . . , θ(m)) be two probability distributions. A joint distribution matrix P = (P ij ) n×m is called a transport plan or coupling between η and θ if P has η and θ as its marginals. Given a cost matrix C = (C ij ) n×m ∈ (R ≥0 ) m×n , Entropy regularized optimal transport (EOT) (Cuturi, 2013) solves the optimal transport plan P ϵ P that minimizes the entropy regularized cost of transporting η into θ. Thus for a parameter ϵ P > 0: P ϵ P = arg min P ∈U (η,θ) ⟨C, P ⟩ -ϵ P H(P ), where U (η, θ) is the set of all transport plans between η and θ, ⟨C, P ⟩ = i,j C ij P ij is the inner product between C and P , and H(P ) = -ij P ij log P ij + P ij is the entropy of P . Unbalanced Optimal Transport (UOT), introduced by Liero et al. (2018) , is a generalization of EOT that relaxes the marginal constraints. The degree of relaxation is controlled by two regularization terms. Formally, for non-negative scalar parameters ϵ = (ϵ P , ϵ η , ϵ θ ), the UOT plan is, P ϵ (C, η, θ) = arg min P ∈(R ≥0 ) n×m {⟨C, P ⟩ -ϵ P H(P ) + ϵ η KL(P 1|η) + ϵ θ KL(P T 1|θ)}. (1) Here KL(a|b) := i a i log(a i /b i ) -a i + b i is the Kullback-Leibler divergence between vectors. UOT differs from EOT in relaxing the hard constraint that P satisfy the given marginals η and θ, to soft constraints that penalize the marginals being far from η or θfoot_1 . In particular, as ϵ η and ϵ θ → ∞, we recover the EOT problem. Proposition 1. The UOT problem with cost matrix C, marginals θ, η and parameters ϵ = (ϵ P , ϵ η , ϵ θ ) generates the same UOT plan as the UOT problem with tC, θ, η, tϵ = (tϵ P , tϵ η , tϵ θ ) for any t ∈ (0, ∞). Therefore, the analysis on ϵ and tϵ are the same for general cost C. The objective function in Eq. ( 1) is linear on C, ϵ P , ϵ η , ϵ θ , so a positive common factor does not affect the solution. In the discussion under general cost matrix C, properties that hold for ϵ are also valid for all tϵ (t > 0). UOT plans can be solved efficiently via a Sinkhorn-style algorithm (Sinkhorn and Knopp, 1967) . Roughly speaking, (η, θ, ϵ)-unbalanced Sinkhorn scaling of a matrix M is iterated alternation of row and column normalizations of M with respect to (η, θ, ϵ) (see Algorithm 2). Chizat et al. (2018) shows that: given a cost C, the UOT plan P ϵ can be obtained by applying (η, θ, ϵ)-unbalanced Sinkhorn scaling on K ϵ := e -1 ϵ P C = (e -1 ϵ P Cij ) m×n , with convergence rate Õ( mn ϵ P ) (Pham et al., 2020) . Generalized Belief Transport. Learning, efficiently transport one's belief with constraints, is naturally a UOT problem. Each round, a learner, defined by a choice of ϵ = (ϵ P , ϵ η , ϵ θ ), updates their beliefs as follows. Let η k-1 , θ k-1 be the learner's estimations of the data distribution and the belief over hypotheses H after round k -1, respectively. At round k, the learner first improves their estimation of the mapping between D and H, denoted by M k , through solving the UOT plan Eq. ( 1) with (C, η k-1 , θ k-1 ), i.e. M k = P ϵ (C, η k-1 , θ k-1 ). Then with data observation d k , the learner updates their beliefs over H using corresponding row of M k , i.e. suppose d k = d i for some d i ∈ D, the learner's belief θ k is defined to be the row normalization of the i-th row of M k . Finally, the learner updates their data distribution to η k by increment of the i-th element of η k-1 , see Algorithm 1. Algorithm 1 Generalized Belief Transport input: C, θ 0 , η 0 , h * , N , data sampler τ based on P(d|h * ), stopping condition ω output: M , θ initialize: k ← 1 while k < N and not ω(θ) do M ← P ϵ (C, η k-1 , θ k-1 ) get data d i sampled from τ η k ← update(η k-1 , d i ) via update rule v ← M (i,_) θ k ← v/ h∈H v(h) k ← k + 1 end while Algorithm 2 Unbalanced Sinkhorn Scaling input: C, θ, η, ϵ = (ϵ P , ϵ η , ϵ θ ), N stopping condition ω output: P ϵ (C, η, θ) initialize: K = exp(-ϵ P C), v (0) = 1 m while k < N and not ω do u (k) ← η Kv (k-1) ϵη ϵη +ϵ P , v (k) ← θ K T u (k) ϵ θ ϵ θ +ϵ P end while P ϵ (C, η, θ) = diag(u)Kdiag(v)

3. GENERALIZED BELIEF TRANSPORT (GBT)

Many learning models with different constraints-including Bayesian inference, Frequentist inference, Cooperative learning, and Discriminative learning-are unified under our GBT framework by varying the choice of ϵ. In this section, we focus on the single-round behavior of the GBT model, i.e., given a pair of marginals (θ, η), how different learners update beliefs. We first visualize the entire learner set as a cube (in terms of parameters), see Figure 1 . Then, we study the topological properties of the learner set through continuous deformations of parameters ϵ. In particular, we show that existing models including Bayesian inference, cooperative inference and discriminative learning are learners with parameters (1, 0, ∞), (1, ∞, ∞) and (0, ∞, ∞) respectively in our UOT framework.

3.1. THE PARAMETER SPACE OF GBT MODEL

The space of constrained belief-updating learners in GBT are parameterized by three regularizers for the underlying UOT problem (1): ϵ P , ϵ η and ϵ θ , each ranges in [0, ∞). Therefore, the constraint space for GBT is Rfoot_2 ≥0 , with the standard topology. When C, θ and η are fixed (assume η ∈ R m + ), the map ϵ = (ϵ P , ϵ η , ϵ θ ) → (P ϵ ) bears continuous properties: Proposition 2. 3 The UOT plan P in Equation (1), as a function of ϵ, is continuous in (0, ∞)×[0, ∞) 2 . Furthermore, P is differentiable with respect to ϵ in the interior. Continuity on ϵ provides the basis for interpolation between different learning agents. The proof of Proposition 2 also implies the continuity on η and θ. Further, towards the boundaries of the parameter space (where theories like Bayesian, Cooperative Communication live in), we show: Proposition 3. For any finite s P , s η , s θ ≥ 0, the limit of P ϵ exists as ϵ approaches (∞, s η , s θ ). In fact, lim ϵ→(∞,sη,s θ ) P ϵ ij = 1 for all i, j. Moreover, P ϵ converges to the solution to min⟨C, P ⟩ -s P H(P ) + s θ KL(P T 1|θ), with constraint P 1 = η, as ϵ → (s P , ∞, s θ ). Similarly, P ϵ converges to the solution to min⟨C, P ⟩ -s P H(P ) + s η KL(P 1|η), with constraint P T 1 = θ, as ϵ → (s P , s η , ∞). And when ϵ → (s P , ∞, ∞), the matrix P ϵ converges to the EOT solution: min⟨C, P ⟩ -s P H(P ), with constraints P T 1 = θ and P 1 = η. When ϵ → (∞, ∞, s θ ), (∞, s η , ∞) or (∞, ∞, ∞), the limit does not exist, but the directional limits can be calculated. The generalized parameter space for UOT with its boundaries can be visualized in Fig. 1 . Function sigmoid(log(x)) maps segment [0, ∞) to [0, 1) smoothly. Then we can add boundaries to the image cube [0, 1) 3 . The dashed lines in the figure indicates limits that do not exist. The parameter space is then S = [0, ∞] 3 \({(∞, ∞, x) : x ∈ [0, ∞]} ∪ {(∞, x, ∞) : x ∈ [0, ∞]}). Later, we may still mention (∞, ∞, ϵ θ ) and (∞, ϵ η , ∞), only for case where the direction is vertical (along axis of ϵ P ).

3.2. SOME SPECIAL POINTS IN THE PARAMETER SPACE

Bayesian Inference. Given a data observation, a Bayesian learner (BI) (Murphy, 2012) derives posterior belief P(h|d) based on prior belief P(h) and likelihood matrix P(d|h), according to the Bayes rule. Intuitively, due to soft time constraint (ϵ P = 1), a Bayesian learner is a generative agent who puts a hard internal constraint on their prior belief (ϵ θ = ∞), and omits the estimated data distribution η in the learning process, (ϵ η = 0). As a direct application of Prop 3, we show: Corollary 4. Consider a UOT problem with cost C = -log P(d|h), marginals θ = P(h), η ∈ P(D). The optimal UOT plan P (1,ϵη,ϵ θ ) converges to the posterior P(h|d) as ϵ η → 0 and ϵ θ → ∞. Bayesian inference is a special case of GBT with ϵ = (1, 0, ∞). Moreover, by relaxing the constraint on the prior (i.e., 0 < ϵ θ < ∞), one obtains a parameterized family of less informative priors. Frequentist Inference. A frequentist updates their belief from data observations by increasing the corresponding frequencies of datum. Intuitively, a frequentist is an agent who puts a hard constraint on the data distribution η (ϵ η = ∞), and omits prior knowledge θ (ϵ θ = 0) in a learning process without time constraint (ϵ P = ∞). Formally we show: Corollary 5. Consider a UOT problem with θ ∈ P(H), η = P(d). The optimal UOT plan P (ϵ P ,∞,0) converges to η ⊗ 1 as ϵ P → ∞. Frequentist Inference is a special case of GBT with ϵ = (∞, ∞, 0). Cooperative Communication. Two cooperative agents, a teacher and a learner, are considered in Yang et al. (2018) ; Wang et al. (2020b) ; Shafto et al. (2021) . Cooperative learners (CI) draw inferences about hypotheses based on which data would be most effective for the teacher to choose (see a brief model summary in the Appendix A). Given a data observation, a cooperative learner derives an optimal plan L = P(H, D) based on a prior belief P(h), a shared data distribution P(d) and a matrix M specifies the consistency between data and hypotheses (such as M ij records the co-occurrence of d i and h j ). Intuitively, a cooperative learner is also a generative agent who puts hard constraints on both data and hypotheses (ϵ η = ∞, ϵ θ = ∞), and aims to align with the true belief asymptotically, (ϵ P = 1). Thus as a direct application of Proposition 3 we show: Corollary 6. Let cost C = -log M , marginals θ = P(h) and η = P(d). The optimal UOT plan P (1,ϵη,ϵ θ ) converges to the optimal plan L as ϵ η → ∞ and ϵ θ → ∞. Cooperative Inference is a special case of GBT with ϵ = (1, ∞, ∞), which is exactly entropic Optimal Transport (Cuturi, 2013) . Discriminative learning. A discriminative learner decodes an uncertain, possibly noise corrupted, encoded message, which is a natural bridge to information theory (Cover, 1999; Wang et al., 2020b) . A discriminative learner builds an optimal map to hypotheses H conditioned on observed data D. The map is perfect when, for all messages, encodings are uniquely and correctly decoded. Intuitively, a discriminative learner aims to quickly build a deterministic code book (implies ϵ P = 0) that matches the marginals on H and D. Thus, discriminative learner is GBT with ϵ = (0, ∞, ∞): Corollary 7. Consider a UOT problem with cost C = -log P(d, h), m = n, and marginals θ = η are uniform. The optimal UOT plan P (ϵ P ,ϵη,ϵ θ ) approaches to a diagonal matrix as ϵ η , ϵ θ → ∞ and ϵ P → 0. In particular, discriminative learner is a special case of GBT with ϵ = (0, ∞, ∞), which is exactly classical Optimal Transport (Villani, 2008) . Many other interesting models are unified under GBT framework as well. GBT with ϵ = (0, ∞, 0) denotes Row Greedy learner which is widely used in Reinforcement learning community (Sutton and Barto, 2018); ϵ = (∞, ∞, ∞) yields η ⊗ θ which is independent coupling used in χ 2 (Fienberg et al., 1970) ; ϵ = (ϵ P , ϵ θ , ∞) is used for adaptive color transfer studied in (Rabin et al., 2014) ; and ϵ = (0, ϵ θ , ϵ η ) is UOT without entropy regularizer developed in (Chapel et al., 2021) . Other points in the GBT parameter space are also of likely interest, past or future.

3.3. GENERAL PROPERTIES ON THE TRANSPORTATION PLANS

The general GBT framework builds a connection between the above theories, and the behavior of theory varies according to the change of parameters. In particular, each factor of ϵ = (ϵ P , ϵ η , ϵ θ ) expresses different constraints of the learner. Given (C, θ, η) as shown in the top-left corner of Fig. 1 , we plot each learner's UOT plan with darker color representing larger elements. ϵ P controls a learner's learning horizon. When ϵ P → 0, agents are under the time pressure of making immediate decision, hence GBT converges a discriminative learner, or Row Greedy learner on the bottom of the cube (Fig. 1 ). Their UOT plans have a clear leading diagonal which allows them to make fast decisions. Most of the time, one datum is enough to identify the true hypothesis and convergence is achieved within every data observation. When ϵ P → ∞, GBT converges to a reticent learner, such as learners on the top of the cube. Data do not constrain the true hypothesis, and learners draw their conclusions independent of the data. In between, GBT provides a generative (probabilistic) learner. When ϵ P = 1, we have Bayesian learner and Cooperative learner, for whom data accumulate to identify the true hypothesis in a manner broadly consistent with probabilistic inference, and consistency is asymptotic. ϵ η controls a learner's knowledge on the data distribution η. When ϵ η → ∞, GBT converges to a learner who is aware of the data distribution and reasons about the observed data according to the probabilities/costs of possible outcomes. Examples include the Discriminative and Cooperative learners on the front of the cube. When ϵ η → 0, GBT converges to a learner who updates their belief without taking η into consideration, such as Bayesian learners on the back of the cube, and the Tyrant who does not care about data nor cost and is impossible to be changed by anybody. ϵ θ controls the strength of availability of prior knowledge for the learner. When ϵ θ → ∞, GBT converges to a learner who enforces a prior over the hypotheses, such as Bayesian, Cooperative and Discriminative learners on the right of the cube. Actually, BGT follows Bayes rule when ϵ θ = ∞ (Prop 8). When ϵ θ → 0, GBT converges to learners who utilizes no prior knowledge. Hence they do NOT maintain beliefs over H, and draws their conclusions purely on the data distribution, such as a Frequentist learner η ⊗ 1 on the left of the cube. Proposition 8. In GBT with ϵ θ = ∞, cost C and current belief θ. The learner updates θ with UOT plan in the same way as applying Bayes rule with likelihood from P ϵ (C, η, θ), and prior θ.

4. SEQUENTIAL GBT: ASYMPTOTIC BEHAVIOR

One interesting difference between the one-shot case considered above and the sequential case is the possibility of observing many data points. In addition to the learning models in the GBT parameter space, in this section, we consider whether the learner's marginal on the data is fixed a priori, or accumulates evidence based on experience.

4.1. BASICS

The sequential GBT model consists of a teacher and a learner. The teacher samples data from a probability distribution η (not necessarily related to some h ∈ H), and the learner follows GBT with cost C, and parameter ϵ. The learner starts with a prior θ 0 , and applies in each round k GBT with η k-1 and θ k-1 to generate θ k through the UOT solution M k . In the Preliminary sequential model (PS), we assume η k = η for all k. However, in practice, a learner does not have access η = P(d|h * ). Instead, in each round the learner may choose to use the current statistical distribution in data as an estimation of η, i.e., η k (d) ∝ |{i : i < k, d i = d}| + n 0 (d) according to the observed data sequence, where n 0 (d) > 0 (e.g., 1 as in add-one smoothing (Murphy, 2012) ) is the prior counts to avoid zero in η. Thus we have the Real sequential model (RS) where η k a.s --→ η. It is easy to see that the sequence of posteriors form a time-homogeneous Markov chain on P(H). In statistics, a model is said to be consistent (strongly-consistent) when, for every fixed hypothesis h ∈ H, the model's belief θ over the hypotheses set H converges to δ h in probability (almost surely) as more and more data are sampled from η = P(d|h), when θ's are considered random variables. The consistency has been well studied for Bayesian Inference since Bernstein and von Mises and Doob (Doob, 1949) , and recently demonstrated for Cooperative Communication (Wang et al., 2020a) . The challenging problem arises when one tries to learn a h * that is not contained in the pre-selected hypothesis space H. It is not clear which h ∈ H is the 'correct' target to converge to. Thus consistency does not fit the situation in sequential GBT. For sequential GBT models, we state the properties directly in the language of posterior sequence (Θ k ) ∞ k=1 as random variables, and name them if necessary. We focus on whether the sequence converges (and in which sense), and how conclusive (how likely to provide a stable, fixed h ∈ H as the result) the sequence is. We provide some theoretical conclusions, and fill the gaps with empirical results and conjectures.

4.2. RESULTS AND CONJECTURES

Results in this section are stated on different ϵ θ values. According to Prop. 1, we could focus on ϵ P = 1 for generic cost matrix C, and general result of (ϵ P , ϵ η , ϵ θ ) becomes the same as the (1, ϵ η /ϵ P , ϵ θ /ϵ P ) case. So we choose ϵ P = 1 in simulations. ϵ θ = ∞: Conclusive and Bayesian-style. These are located on the right side of Fig. 1 , and contain many well-studied learners: Bayesian, Cooperative, Discriminative, Row Greedy etc. According to Prop 8, learners in this class perform "Bayesian" style learning. There are two theoretical results: ϵ η = 0 (Bayesian) and ϵ η = ∞ (SCBI learner (Wang et al., 2020a) ). Others are explored in simulations. Theorem 9 ((Doob, 1949) , (Wang et al., 2020a) ). In GBT sequential model (both (PS) and (RS)) with ϵ = (ϵ P , 0, ∞) where ϵ P ∈ (0, ∞), the sequence Θ k converges to some δ h almost surely, h is the closest column of e -C/ϵ P to η in the sense of KL-divergence, when θ is positive (on each entry). When ϵ η = ϵ θ = ∞, the models (PS) and (RS) have slightly different behaviors. Lemma 10. For ϵ = (ϵ P , ∞, ∞), ϵ P ∈ (0, ∞), given cost C with initial belief θ 0 ∈ P(H) and fixed teaching and learning distribution η k = η ∈ P(D) for all k, i.e., model (PS), then the belief random variables (Θ k ) k∈N have the same expectation on h: Thus a GBT learner, with access to the data distribution and using strict marginal constraints, converges to a distribution on D same as η with probability 1. Moreover, the probability of which column h is shaped into η is determined by their prior θ 0 . That is, GBT learners converge to the truth by changing one of their original hypotheses into the true hypothesis. E Θ k [θ(h)] = θ 0 (h). For the (RS) model, the result is similar, but Lemma 10 fails to hold: Proposition 13. Consider a learning problem with cost C, initial belief θ 0 ∈ P(H), the true hypothesis h * defined by η ∈ P(D). For the (RS) problem, the belief random variables (Θ k ) k∈N satisfies that for any s > 0, lim k→∞ h∈H P(Θ(h) > 1 -s) = 1. As a consequence, M k as the transport plan has a dominant column (h j ) with total weights > 1 -s, and |(M k ) ij -η k (i)| < s. In fact, as long as the sequence of η k as random variables converges to η in probability, the above proposition holds. The limit lim k→∞ h∈H P(Θ(h) > 1 -s) measures how conclusive the model is. In contrast with standard Bayesian or other inductive learners, Proposition 13 shows that a GBT learner is able to learn any hypothesis mapping η = P(d|h * ) up to a given threshold s with probability 1. In addition to unifying disparate models of learning, GBT enables a fundamentally more powerful approach to learning by empirically monitoring the data marginal. Fig. 2 illustrates convergence over learning problems and episodes. In each bar, we sample 100 learning problem (C, θ 0 , h * ) from Dirichlet distribution with hyperparameters the vector 1. Then we sample 1000 data sequences (episodes) of maximal length N = 10000. The learner learns with Algorithm 1 where the stopping condition ω is set to be max h∈H θ(h) > 1 -s with s = 0.001. The y-axis in the plots represents the percentage of total episode converged. The color indicates in how many rounds the episode converges. For instance, in the bar corresponding to '10 × 10_update_uot', with 10 data points (yellow portion), about 50% episodes satisfy the stopping condition. In Figure 2 , the first plot shows results for 10 × 10 and 5 × 3 matrices. The second plot shows results for rectangular matrices of dimension m × 10 with m ranges in [5, 10, 25, 50, 100] . The third plot shows results for square matrices of dimension m × m with m ranges in [10, 25, 50, 100] . Here 'exact' and 'update' indicate the problem is (PS) or (RS), respectively. For parameters, uot represents the parameter choice (ϵ P = 1, ϵ θ = ϵ η = 40) vs. ot represents the parameter choice (ϵ P = 1, ϵ θ = ϵ η = ∞). The first plot illustrates that learners that do not have access to the true hypothesis (empirically builds estimation of η) learn faster than learners who have full access. The second plot indicates with a fixed number of hypotheses, learning is faster when the dimension of D increases. The third plot shows that the GBT learner scales well with the dimension of the problem. In Fig. 3 , we study the asymptotic behavior of the learners corresponding to ϵ = (1, ϵ η , ∞), with ϵ η ∈ {0, 0.02, 0.2, 0.5, 1, 2, 5, 50, ∞}. We sample a learning problem with a dimension 5 × 5 from Dirichlet distribution with hyperparameters the vector 1. Each learner ϵ = (1, ϵ η , ∞) is equipped with a fixed C, θ 0 and η k = η for all k. We run 400, 000 learning episodes per learner, and plot their convergence summary in the bar graph. A continuous transition from a Bayesian learner to a cooperative learner can be empirically observed: the coefficients a(h) of the limit in terms of h∈H a(h)δ h changes from δ(h 3 ) by Bernstein-von Mises to θ 0 (h) by Theorem 11. From the previous empirical results, we conclude the following conjecture: Conjecture 14. When ϵ = (ϵ P , ϵ η , ∞), where ϵ P ∈ (0, ∞), the sequence of posteriors Θ k from generic C, η, θ and ϵ as random variables satisfy lim k→∞ h∈H P(|Θ k (h) -1| < e) = 1 for any e > 0. We further report an empirical property observed in simulation, which suggests a possible rate of convergence. For given C, θ 0 and η, fix ϵ P = 1 and ϵ θ = ∞, as ϵ η changes from 0 to ∞, we pick out those episodes with θ N (h) > 0.95 and plot the values E θ N (h)>0.95 [ln θ k (h) -ln(1 -θ k (h))] for each h against k (Fig. 4 bottom). Near linear relations are observed away from the first several rounds and before the values reaches the precision threshold. These are empirical estimates of the rate of convergence. There is a special case on the boundary, the Independent Coupling (∞, ∞, ∞), whose limit is taken vertically along ϵ P -axis, see Sec. 3.1. Independent Coupling has a fixed posterior, where Law(Θ k ) = δ θ0 , as the normalization of each row of P (∞,∞,∞) is θ 0 . ϵ θ = 0: Inconclusive and independent. The following holds for both (PS) and (RS): Proposition 15. For ϵ = (ϵ P , ϵ η , 0) with ϵ P ∈ (0, ∞), as η k → η almost surely, the sequence Θ k of posteriors as a sequence of random variables converges in probability to variable Θ, where With ϵ θ = 0, the constraint on column-sum (ϵ η -term) fails to affect the transport plan, thus the Θ k 's in the sequence are independent from each other, in contrast that in all other cases the adjacent ones are correlated via a nondegenerate transition distribution. The independence makes the sequence of posterior-samples in one episode behave totally random, thus rarely converge as points in P(H). Furthermore, when consider the natural coupling (Θ k-1 , Θ k ) from Markov transition measure for ϵ θ = 0 (which is independent), E |Θ k-1 -Θ k | 2 converges to the variance V ar(η). In contrast, for ϵ θ = ∞, E |Θ k-1 -Θ k | 2 converges to 0 if Conj. 14 holds. ϵ θ ∈ (0, ∞): partially conclusive. From Conj. 14 and Prop. 15, together with the continuity of the transition distribution on ϵ, we conjecture the following continuity on conclusiveness when ϵ P ∈ (0, ∞). Conjecture 16. For both (PS) and (RS) models, when ϵ = (ϵ P , ϵ η , ϵ θ ) with ϵ P , ϵ θ ∈ (0, ∞), the posterior sequence Θ k from generated from generic C, η, θ and ϵ satisfy that lim k→∞ h∈H P(|Θ k (h)-1| < s) = L exists, and L ∈ (0, 1), for any s > 0.

5. RELATED WORK

Prior work defines and outlines basic properties of Unbalanced Optimal Transport (Liero et al., 2018; Chizat et al., 2018; Pham et al., 2020) . Bayesian approaches are prominent in machine learning (Murphy, 2012) and beyond (Jaynes, 2003; Gelman et al., 1995) . There is also research on cooperative learning (Wang et al., 2019; 2020b; a) see also (Liu et al., 2021; Yuan et al., 2021; Zhu, 2015; Liu et al., 2017; Shafto and Goodman, 2008; Shafto et al., 2014; Frank and Goodman, 2012; Goodman and Frank, 2016; Fisac et al., 2017; Ho et al., 2018; Laskey et al., 2017) . Discriminative learning is the reciprocal problem in which one sees data and asks which hypothesis best explains it (Ng and Jordan, 2001; Mandler, 1980) . We are unaware of any work that attempts to unify and analyze the general problem of learning in which each of these are instances.

A ADDITIONAL MATERIALS

Algorithm 3 Unbalanced Sinkhorn Scaling input: C, θ, η, ϵ = (ϵ P , ϵ η , ϵ θ ), N stopping condition ω initialize: K = exp(-ϵ P C), v (0) = 1 m while k < N and not ω do u (k) ← ( η Kv (k-1) ) ϵη ϵη +ϵ P , v (k) ← ( θ K T u (k) ) ϵ θ ϵ θ +ϵ P end while output: M = diag(u)Kdiag(v) Cooperative Communication. Cooperative communication formalizes a single problem comprised of interactions between two processes: teaching and learning. The teacher and learner have beliefs about hypotheses, which are represented as probability distributions. The process of teaching is to select data that move the learner's beliefs from some initial state, to a final desired state. The process of learning is then, given the data selected by the teacher, infer the beliefs of the teacher. The teacher's selection and learner's inference incur costs. The agents minimize the cost to achieve their goals. Communication is successful when the learner's belief, given the teacher's data, is moved to the target distribution. Formally, denote the common ground between agents: the shared priors on H and D by P(h) and P(d), the shared initial matrix over D and H by M of size |D| × |H|. In general, up to normalization, M is simply a non-negative matrix which also specifies the consistency between data and hypothesesfoot_4  In cooperative communication, a learner's goal is to minimize the cost of transforming the observed data distribution P(D) to the shared prior over hypotheses P(H). A learner's cost matrix C L = (C L ij ) |M|×|H| is defined as C L ij = -log M . A learning plan is a joint distribution L = (L ij ) , where L ij = P L (d i , h j ) represents the probability of the learner inferring h j given d i . It is proved in (Wang et al., 2019) that: Proposition 17. Optimal cooperative communication plans, L, is the EOT plan with cost C L and marginals being η = P(d) and θ = P(h).

B PROOFS

Proposition 1. The UOT problem with cost matrix C, marginals θ, η and parameters ϵ = (ϵ P , ϵ η , ϵ θ ) generates the same UOT plan as the UOT problem with tC, θ, η, tϵ = (tϵ P , tϵ η , tϵ θ ) for any t ∈ (0, ∞). Proof. Consider that the UOT problem solution is P ϵ (C, η, θ) = arg min P ∈(R ≥0 ) n×m {⟨C, P ⟩ -ϵ P H(P ) + ϵ η KL(P 1|η) + ϵ θ KL(P T 1|θ)}. ( ) where the objective function is linear on C and ϵ. P tϵ (tC, η, θ) = arg min P ∈(R ≥0 ) n×m {⟨tC, P ⟩ -tϵ P H(P ) + tϵ η KL(P 1|η) + tϵ θ KL(P T 1|θ)}

= arg min

P ∈(R ≥0 ) n×m t • {⟨C, P ⟩ -ϵ P H(P ) + ϵ η KL(P 1|η) + ϵ θ KL(P T 1|θ)} = P ϵ (C, η, θ). Proposition 2. The UOT plan P in Equation (1), as a function of ϵ, is continuous in (0, ∞) × [0, ∞) 2 . Furthermore, P is differentiable with respect to ϵ in the interior. Proof. For simplicity, in this proof, for a vector v, we use both v i and v(i) to represent a component of v. By definition, the UOT plan P minimizes the objective function Ω(P ; ϵ) = ⟨C, P ⟩ -ϵ P H(P ) + ϵ η KL(P 1|η) + ϵ θ KL(P T 1|θ). Since Ω is a strict convex function on P , there is only one minimal P . So the UOT plan P is the solution to ∇ P Ω = 0. From a direct calculation, (∇ P Ω) ij = C ij + ϵ P ln P ij + ϵ η (ln( m k=1 P ik ) -ln η(i)) + ϵ θ (ln( n k=1 P kj ) -ln θ(j)) and (∇ 2 P Ω) ijkl = ϵ P δ ik δ jl P ij + ϵ η δ ik m t=1 P it + ϵ θ δ jl n t=1 P tj . As we assume that P ij > 0 for all i, j, all the terms above are well-defined. Besides, ∇ P Ω is C 1 on η, θ and ϵ. Therefore, we can show P ϵ (C, η, θ) is continuous not only on ϵ but also on η and θ after checking Hessian. From implicit function theorem, if we show the above Hessian is invertible for ϵ P > 0, then the results of the proposition are true. Equivalently, it suffices to show that det H ̸ = 0 where matrix H is the flattened ∇ 2 P Ω by mapping (i, j, k, l) → (im + j, km + l). Invertibility of H. Let r be the vector of reciprocals of row sums of P , i.e., r i = 1/ j P ij , and similarly, let c be the vector of reciprocals of column sums of P , i.e., c j = 1/ ( i P ij ). Then (∇ 2 P Ω) ijkl = ϵ P δ ik δ jl P ij + ϵ η δ ik r i + ϵ θ δ jl c j . Let ϕ be the map (i, j) → (im + j), then ϕ induces a reshaping of P to a vector of size mn, denoted by P ϕ . When there is no ambiguity, we may omit the ϕ superscript. Further define p ϕ as a vector of dimension mn where p ϕ k = ϵ P /P ϕ k . By definition, H ϕ = ϵ P (diag(p ϕ )) + ϵ η 1 m ⊗ (diag(r)) + ϵ θ (diag(c)) ⊗ 1 n where 1 k is the k × k matrix of ones, and A ⊗ B is Kronecker product (tensor product of matrices). Decompose H = D + G where D = ϵ P (diag(p ϕ )) and G = ϵ η 1 m ⊗ (diag(r)) + ϵ θ (diag(c)) ⊗ 1 n . From now on, we may use P -row, P -column to represent i, j style indices, and G-row, G-column or simply row/column to represent those of G, or the ones in range [1, mn]. D is diagonal, and det G = 0. Furthermore, ( * ) any row or column of G with index k can be represented by an entry position (i, j) of P by inverse of ϕ, and any rows of indices k 1 , k 2 , k 3 , k 4 corresponding to (i 1 , j 1 ), (i 1 , j 2 ), (i 2 , j 1 ), (i 2 , j 2 ) (i.e., determined as intersections of two P -rows and two P -columns) is linearly dependent: Next we show that f (I) is nonnegative for all I, then with p k > 0 for all k, we can conclude that det H > 0. Since I ⊆ {1, 2, . . . , mn}, ϕ -1 (I) ⊆ {1, 2, . . . , n} × {1, 2, . . . , m}, and ϕ is a bijection, we may not distinguish I from ϕ -1 (I), in order to make the statement neater. G (k1,_) + G (k4,_) -G (k2,_) -G (k3,_) = 0,

1.. [Operation-( * ) on I]:

We want to investigate the operations on I producing a subset J such that f (I) = f (J ). By the properties of determinant, ( * ) induces one operation: when I containing 4 integer pairs which can form the vertices of a rectangle, f (I) = 0. Moreover, for any k 1 , k 2 , k 3 , k 4 such indices in ( * ), we can generate row G (k4,_) by G (k4,_) = G (k2,_) +G (k3,_) -G (k1,_) , then if {k 1 , k 2 , k 3 } ⊆ I, we can build G (k4,_) on any G (ki,_) , thus the determinant det G row (I,I) = ± det G (I,I) (positive for k 2 and k 3 , negative for k 1 ). Similarly, if we follow the same operation on columns, we have det G col (I,I) = ± det G (I,I) . And when doing both, det G col•row (I,I) = det G (I,I) . Therefore, we know that if k 1 , k 2 , k 3 ∈ I, and J = {k 4 } ∪ I\{k i } for any i = 1, 2, 3, then f (I) = f (J ). Such operations changing I to J is denoted by operation- * . In short, an operation- * moves an end of a small "L-shaped" set of 3 pairs along a P -row or a P -column, producing another L-shaped set of 3 pairs.

2.

[Regularized form of I, and decomposition of nondegenerate regularized form I ♯ into L-shaped subsets]: Once I or any J equivalent to I via operations- * contains 4 pairs satisfying condition ( * ), f (I) = 0, then we call I degenerate. In decomposing I, when we find it degenerate, we stop since f (I) is known. We decompose I as set of pairs inductively in the following way before stopping. Start with any (i, j) ∈ I, we look for pairs of form (i, l) and (k, j) in I, adding them into the subset A (i,j) containing (i, j). Then check the degeneracy, by looking for whether I contains a point (k, l) with (i, l), (j, k) ∈ A (i,j) , whenever I is degenerate, we stop since f (I) = 0. Next we enlarge A (i,j) by changing the set I to a regularized form using operation- * 's. For each (k, l) with (i, l) ∈ A (i,j) , then (k, j) can be constructed on (k, l) via an operation- * with (i, j) and (i, l). Thus we modify I into J = (i, l) ∪ I\(k, l) that f (I) = f (J ), and adding (i, l) into set A (i,j) . Similar process can be done for those (k, l) ∈ I with (k, j) ∈ A (i,j) . After regularizing I and enlarging A (i,j) to maximum about (i, j), we get a regularized form J of I, with f (I) = f (J ), and a component A (i,j) of L-shape. The set of J \A (i,j) has no elements of form (k, l) with (i, l) ∈ A (i,j) or (k, j) ∈ A (i,j) , as they are already moved to A (i,j) by operation- * . Therefore, J \A (i,j) is supported on a rectangular region by deleting all P -rows (k, _)'s and P -columns (_, l)'s where k, l's occur in A (i,j) . Repeating the L-shaped component construction above for J \A (i,j) , we can transform I into a regularized form (not unique or standard) I ♯ and we have a decomposition I ♯ = A (it,jt) into L-shaped components, which do not intersect with each other. The name "regularized form" is given to the transformed set with a L-shaped decomposition, and since only operation- * is applied, f (I) = f (I ♯ ). 3. [Properties between the L-shaped subsets:] For each I which we did not conclude f (I) = 0 in the last step, we get I ♯ and a decomposition I ♯ = t∈T A t into L-shaped subsets. The construction of components A t induces such a property: for two distinct components A t there is no elements (i, j) ∈ A t and (k, l) ∈ A s , in normal words, the A t occupies certain P -rows and P -columns which is distinct from those of A s . For (i, j) and (k, l) with i ̸ = k and j ̸ = l, G im+j,km+l = 0 from the formula that G im+j,km+l = ϵ η r i δ ik + ϵ θ c j δ jl . Therefore, the decomposition I ♯ = t∈T A t induces a decomposition of matrix G (I ♯ ,I ♯ ) into blockwise diagonal matrix     G A1,A1 0 . . . 0 0 G A2,A2 . . . 0 . . . . . . . . . 0 0 . . . G At,At     (4) So for a decomposition I ♯ = t∈T A t , we haves f (I ♯ ) = t∈T f (A t ) 4. [f (A) for an L-shaped component]: The last part is to show f (A) > 0 for all L-shaped components. Recall that G im+j,km+l = ϵ η r i δ ik + ϵ θ c j δ jl , so for A an L-shaped component with s P -rows and t P -columns, G (A,A) in general is of form  G (A,A) =            r 1 + c 1 . . .            (5) Recall the formula det E B C D = det(E) det(D -CE -1 B) and the matrix determinant lemma det(diag(c) + r11 T ) = (1 + r1 T diag(c) -1 1) det(diag(c)) = c i (1 + (r/c i )). If s = 1 or t = 1, the determinant of G (A,A) can be calculated directly by the matrix determinant lemma above. If s > 1 and t > 1, we cut Eq. ( 5) into 4 blocks E B C D where E contains the upper left t × t part, B is zero but the last row, C is zero but the last column, D is a matrix in a similar form as E. According to the characters of B, C stated above, it can be found that CE -1 B = c 2 t 1E -1 t,t 1 T which is an s × s-matrix. The entry E -1 t,t = det E (1:t-1,1:t-1) / det E where E (1:t-1,1:t-1) is the matrix E without the last row and last column, moreover, E -1 t,t = t-1 1 c i (1 + t-1 1 (r 1 /c i )) / t 1 c i (1 + t 1 (r 1 /c i )) = 1 + t-1 1 (r 1 /c i ) c t (1 + t 1 (r 1 /c i )) < 1/c t . Therefore, CE -1 B = λ11 T with λ < c t and D -CE -1 B = diag(r 2:s ) + (c t -λ)11 T , whose determinant is positive according to the matrix determinant lemma. As a consequence, det G (A,A) > 0 for each L-shaped components A. So combining the discussions in [1-4], we have det H = det(D + G) > 0. Then the implicit function theorem implies the differentiability of P ϵ on ϵ. Proposition 3. For any finite s P , s η , s θ ≥ 0, the limit of P ϵ exists as ϵ approaches to (∞, s η , s θ ). In fact, lim ϵ→(∞,sη,s θ ) P ϵ ij = 1 for all i, j (Limit 1). Moreover, P ϵ converges to the solution to min⟨C, P ⟩ -s P H(P ) + s θ KL(P T 1|θ), with constraint P 1 = η, as ϵ → (s P , ∞, s θ ) (Limit 2). Similarly, P ϵ converges to the solution to min⟨C, P ⟩ -s P H(P ) + s η KL(P 1|η), with constraint P T 1 = θ, as ϵ → (s P , s η , ∞) (Limit 3). And in the case when ϵ → (s P , ∞, ∞), the matrix P ϵ converges to the EOT solution (Limit 4): min⟨C, P ⟩ -s P H(P ), with constraints P T 1 = θ and P 1 = η. When ϵ → (∞, ∞, s θ ), (∞, s η , ∞) or (∞, ∞, ∞), the limit does not exist, but the directional limits can be calculated..

Proof.

Recall that H(P ) =ij (P ij ln P ij -P ij ), (∇ P H) ij = -ln P ij , and H(P ) is strictly concave, therefore H has a unique maximum mn at P ij = 1, denoted by 1. Similarly, KL(a|b) = i (a i (ln a i -ln b i ) -a i + b i ), ∇ a KL(a|b) i = ln a i -ln b i , KL is strictly convex, therefore KL has a minimum 0 at a i = b i for all i. Limit 1. Shown by contradiction: When ϵ → (∞, s η , s θ ), suppose the limit lim ϵ→(∞,sη,s θ ) P ϵ ij for some (i, j) does not exist, or is not 1. Thus there is e > 0 that, for any δ > 0 and N > 0, there exists a parameter ϵ 1 = (ϵ P , ϵ η , ϵ θ ) such that ϵ P > N , |ϵ η -s η | < δ and |ϵ θ -s θ | < δ, satisfying |P ϵ ij -1| > e. However, for any 0 < e < 1/2, let δ = 1, let E = (1+e) ln(1+e)-(1+e)+1 > 0, min Ω(P ; ϵ) ≤ Ω(1; ϵ) < C for some G > 0 where (1) ij = 1 for all (i, j), and any ϵ ∈ {(ϵ P , ϵ η , ϵ θ ) : s η /2 < ϵ η < 3s η /2, s θ /2 < ϵ θ < 3s θ /2, }. So there is a N > 0 such that N E > G + max ij C ij + mn + L where L = -inf{ϵ η KL(P 1|η) + ϵ θ KL(P t 1|θ)}, meaning those P with |P ij -1| > e for some (i, j) is not minimizing Ω. The contradiction indicates that lim ϵ→(∞,sη,s θ ) P ϵ ij = 1 for all i, j. Limit 2 & 3: The situation of ϵ θ → ∞ and ϵ η → ∞ are similar, so we only prove for ϵ θ → ∞ case. Let P denote the solution to Eq. ( 7). Let P be the solution to the optimization with constraints. We first show that lim ϵ→(s P ,sη,∞) n k=1 P ϵ kj = θ j . This is similar to limit 1. Suppose the limit either does not exist or is not θ j , then there exists an e > 0 such that for any N > 0, δ > 0, there exists ϵ θ > N , |ϵ η -s η | < δ and |ϵ P -s P | < δ, such that n k=1 P ϵ kj -θ j > e for some j. Thus KL((P ϵ ) T 1|θ) > E for some E > 0. Consider that ⟨C, P ⟩ ≥ 0, H(P ) ≥ -mn and KL(P 1|η) ≥ 0 are lower bounded, we can take sufficiently large N such that the P ϵ satisfying Eq. ( 9) satisfy Ω(P ϵ ; ϵ) > Ω( P ; ϵ), making P ϵ fail to optimize Ω(•; ϵ), which is a contradiction. Thus we have lim ϵ→(s P ,sη,∞) n k=1 P ϵ kj = θ j . For each ϵ = (ϵ P , ϵ η , ϵ θ ), let θ ϵ denote the (P ϵ ) T 1, then for any ϵ, the solution P ϵ is also the solution to min P ⟨C, P ⟩ + ϵ P H(P ) + ϵ η KL(P 1|η), with constraint P T 1 = θ ϵ . Denote Φ(P, ϵ P , ϵ η ) := ⟨C, P ⟩ + ϵ P H(P ) + ϵ η KL(P 1|η) When ϵ P ∈ (0, ∞), the new objective function Φ(P, ϵ P , ϵ η ) is continuous on P and ϵ P ,ϵ η , and each minimization problem gets a unique solution since the objective function is strictly convex. Therefore, the limit lim ϵ→(s P ,sη,∞)P ϵ = P . We show this via contradiction: Suppose the opposite, there exists some ξ > 0 such that ||P ϵ -P || 2 > ξ for ϵ arbitrarily close to (s P , s η , ∞). Let α := inf P T e=θ,||P -P ||2>ξ Φ(P, s P , s η ) -Φ( P , s P , s η ), α > 0 since the minimum P is unique and the objective is strictly convex. The sets P T e = θ ϵ is compact since it is closed and bounded, so there exists bounds b = (b 1 , b 2 , b 3 ) for ϵ = (ϵ P , ϵ η , ϵ θ ) such that in the bound where |ϵ P -s P | < b 1 , |ϵ η -s η | < b 2 and ϵ θ > b 3 , max Φ(P, s P , s η ) -Φ(P ♯ , ϵ P , ϵ η ) < α/3 for P with P T e = θ and P ♯ its Euclidean projection to {P T e = θ ϵ }, and max Φ(P, ϵ P , ϵ η ) -Φ(P ♭ , s P , s η ) < α/3 for P with P T e = θ ϵ and P ♭ its Euclidean projection to {P T e = θ}. Let ϵ be a parameter in the above bound b to (s P , s η , ∞), where P = argmin P T e=θ ϵ Φ(P, ϵ P , ϵ η ) is ξ far from P . Then Φ(P, ϵ P , ϵ η ) > Φ(P ♭ , s P , s η ) -α/3 > Φ( P , s P , s η ) + 2/3α > Φ( P ♯ , ϵ P , ϵ η ) + α/3 > Φ( P ♯ , ϵ P , ϵ η ), which is a contradiction to the assumption that P is the argmin. Limit 4: Similar to the previous two limits, we can say that lim ϵ→(s P ,∞,∞) n k=1 P ϵ kj = θ j and lim ϵ→(s P ,∞,∞) m k=1 P ϵ ik = η i . Then the problem becomes the EOT problem, which has a unique solution. Boundaries at ϵ η = 0 or ϵ θ = 0: It is simple to check the continuity when ϵ η → 0 or ϵ θ → 0. From Prop. 2, the continuity and differentiability hold for ϵ η → 0 or ϵ θ → 0 when ϵ P > 0. Nonexistence of the limits when ϵ P , ϵ η → ∞, and directional limits: Let a sequence ϵ 1 , ϵ 2 , . . . where ϵ i = (ϵ i P , ϵ i η , ϵ i θ ) satisfy lim ϵ i P = lim ϵ i η = ∞ and lim(ϵ i η /ϵ i P ) = t, then the limit P of P ϵ satisfy P ij = t(ln c j -ln n)/(t + 1), since the limit minimizes the following objective function H(P ) + tKL(P 1|η). The reason is, as η i = 1, H(P ) and KL(P 1|η) cannot vanish for the same P , thus the minima of objective function approaches to infinity, therefore the finite terms ⟨C, P ⟩ and ϵ θ KL(P T 1|θ) tend to have no effect on the minimal point P as ϵ P , ϵ η increases to infinity. A direct consequence of the above discussion is, when t changes, the limits P of those sequences changes, which indicates that the limit of P ϵ as ϵ → (∞, ∞, s θ ) fails to exist. And similar situation happens when ϵ → (∞, s η , ∞) Nonexistence of the limits when ϵ P , ϵ η , ϵ θ → ∞, and directional limits : Similar to the discussions above, let the sequence ϵ 1 , ϵ 2 , . . . where ϵ i = (ϵ i P , ϵ i η , ϵ i θ ) satisfy lim i→∞ ϵ i = (∞, ∞, ∞). Further let lim(ϵ i η /ϵ i P ) = u, lim(ϵ i θ /ϵ i P ) = w, then P ϵi converges to the solution to the problem H(P ) + uKL(P 1|η) + wKL(P T 1|θ), which could be considered as another UOT problem with cost function constantly 0. Corollary 4. Consider a UOT problem with cost C = -log P(d|h), marginals θ = P(h), η ∈ P(D). The optimal UOT plan P (1,ϵη,ϵ θ ) converges to the posterior P(h|d) as ϵ η → 0 and ϵ θ → ∞. Bayesian inference is a special case of GBT with ϵ = (1, 0, ∞). Proof. As direct application of Limit 3 of Proposition 3, we only need to show that the optimal plan P (1,0,∞) is propositional to the posterior P (h|d). Calculation shows that the solution to Equation 12 is P ij = P (di|hj )P (hj ) P (1,0,∞) = i P (di|hj ) = P (d i |h j )P (h j ) ∝ P (h j |d i ). Hence the proof is completed. Corollary 5. Consider a UOT problem with θ ∈ P(H), η = P(d). The optimal UOT plan P (ϵ P ,∞,0) converges to η ⊗ 1 as ϵ P → ∞. Frequentist Inference is a special case of GBT with ϵ = (∞, ∞, 0). Proof. As direct application of Proposition 3, we only need to show that P (∞,∞,0) = η ⊗ 1. Notice that Eq 1 is equivalent to P (∞,∞,0) = arg min P ∈(R ≥0 ) n×m H(P ), with constraint P 1 = η (13) Hence P (∞,∞,0) = η ⊗ 1. Corollary 6. Let cost C = -log M , marginals θ = P(h) and η = P(d). The optimal UOT plan P (1,ϵη,ϵ θ ) converges to the optimal plan L as ϵ η → ∞ and ϵ θ → ∞. Cooperative Inference is a special case of GBT with ϵ = (1, ∞, ∞), which is exactly entropic Optimal Transport (Cuturi, 2013) . Proof. According to proposition 17, L = P (1,∞,∞) , and the convergence result is a direct application of Limit 4 of Proposition 3 Corollary 7. Consider a UOT problem with cost C = -log P(d, h), m = n, and marginals θ = η are uniform. The optimal UOT plan P (ϵ P ,ϵη,ϵ θ ) approaches to a diagonal matrix as ϵ η , ϵ θ → ∞ and ϵ P → 0. In particular, discriminative learner is a special case of GBT with ϵ = (0, ∞, ∞), which is exactly classical Optimal Transport (Villani, 2008) . Proof. Limit 4 of Proposition 3 implies the convergence of P (ϵ P ,ϵη,ϵ θ ) → P (0,∞,∞) as ϵ η , ϵ θ → ∞ and ϵ P → 0. When m = n, P (0,∞,∞) is a permutation matrix is the result of Wang et al. (2020b) [Proposition 8]. Proposition 8. In GBT with ϵ θ = ∞, cost C and current belief θ. The learner updates θ with UOT plan in the same way as applying Bayes rule with likelihood from P ϵ (C, η, θ), and prior θ. Proof. From Algorithm 1, for a general data point d i chosen, the GBT takes the vector normalization of some row P ϵ , i.e., θ ′ = P ϵ (i,_) /( j P ϵ ij ). On the other hand, when we apply Bayes rule to P ϵ , prior is θ = P(h), likelihood P(d|h) is the column normalization of P ϵ , satisfying P(d i |h j ) = P ϵ ij /( i P ϵ ij ) = P ϵ ij /θ j . The last equality is because θ(i) = j P ϵ ij when ϵ θ = ∞. So the posterior P(h|d i ) is the vector normalization of P(d i |h)P(h), by P(d i |h j )P(h j ) = P ϵ ij /θ j * θ j = P ϵ ij . Therefore, P(h j |d i ) = θ ′ (h j ). Now, we introduce some notations will be used in the following proofs. Notations. Denote the set of all possible belief by ∆ = P(H). Distribution of Θ k is denoted by µ k . We only consider the case where no two hypotheses are the same in H. Hence we make the following assumption that columns of exp(-ϵ P C) are not differ by a multiplicative scalar, i.e. columns of C are not differ by an additive scalar. Lemma 10. For ϵ = (ϵ P , ∞, ∞), ϵ P ∈ (0, ∞), given cost C with initial belief θ 0 ∈ P(H) and fixed teaching and learning distribution η k = η ∈ P(D) for all k, then the belief random variables (Θ k ) k∈N have the same expectation on h: E Θ k [θ(h)] = θ 0 (h). Proof. We start the proof by showing E Θ k [θ(h)] = E Θ k-1 [θ(h)] for k ≥ 1. Notice that given cost C and data marginal η, an observed data d ∈ D and UOT planning uniquely determines a map from a learner's initial belief θ k-1 to one's posterior belief θ k . Denote this map by T d : θ k-1 → θ k . Let the distribution of Θ k-1 over P(H) be µ k-1 , denote its support by S k-1 . Then the following holds: E Θ k [θ(h j )] = θ∈S k-1 µ k-1 (θ) d i ∈D η i T d i (θ)(h j ) = θ∈S k-1 µ k-1 (θ) d i ∈D η i M k (i, j) η i = θ∈S k-1 µ k-1 (θ) d i ∈D M k (i, j) = θ∈S k-1 µ k-1 (θ)θ(h j ) = E Θ k-1 [θ(h)] Hence E Θ k [θ(h)] = E Θ k-1 [θ(h)] = • • • = E Θ0 [θ(h)] = θ 0 (h). Theorem 11. Consider a learning problem with initial belief θ 0 ∈ P(H), and the true hypothesis h * defined by η ∈ P(D). If the learner's data distribution η k = η, then belief random variables (Θ k ) k∈N converge to the random variable Y in probability, where Y = h∈H θ 0 (h)δ h and Y is supported on {δ h } h∈H with P (Y = δ h ) = θ 0 (h) for ϵ η = ϵ θ = ∞ and ϵ P ∈ (0, ∞). Proof. Step 1: First, we show the following claim inspired the proof proposition 5.1 in Wang et al. (2020a) Claim: lim k→∞ µ k (∆ ϵ ) = 0, for any ϵ > 0, where ∆ ϵ := {θ ∈ ∆ : θ(h) ≤ 1 -ϵ, ∀h ∈ H}. Assume the claim does not hold, then there exists α > 0 and a subsequence (µ ki ) i∈N such that µ ki (∆ ϵ ) > α for all i. Let the center of ∆ be u, we define L(µ) := E µ f (θ), where f (θ) = ∥θ -u∥ 2 2 , (f may also be chosen as entropy H(θ)). Then L(µ k+1 ) = E µ k (E d∼η f (T d (θ))). Notice that f is strictly convex, by Jensen's inequality, E d∼η f (T d (θ)) (a) ≥ f (E d∼η T d (θ)) (b) = f (θ) Here (b) holds because: In the case of (1), if θ 0 ̸ = δ h , Θ k ≡ δ h is contradict to Lemma 10. Otherwise, Y = δ h , the result holds. In the case of (2), according to Wang et al. (2019) , M k is cross-ratio equivalent to exp(-ϵ P C), hence exp(-ϵ P C) has two columns differ by a multiplicative scalar, contradict to the assumption. E d∼η T d (θ) (c) = d i ∈D η i • (M k (i, _)\η i ) = d i ∈D M k (i, _) (d) = θ

Thus for any

θ ∈ ∆ ϵ , E d∼η f (T d (θ)) > f (θ). Therefore L(µ k+1 ) > L(µ k ) for any k. Moreover, notice that ∆ ϵ is compact, there is a lower bound β > 0, such that E d∼η f (T d (θ))-f (θ) > β for all θ ∈ ∆ ϵ . Therefore: L(µ ki+1 ) = E θ k i +1 ∈∆ϵ (E d∼η f (T d (θ))) + E θ k i +1 ∈∆\∆ϵ (E d∼η f (T d (θ))) > E θ k i ∈∆ϵ (f (θ)) + E θ k i ∈∆\∆ϵ (f (θ)) + α * β = L(µ ki ) + α * β. Thus L(µ ki+s ) > L(µ ki ) + s * α * β → ∞ as s → ∞. On the other hand, by definition, f (θ) is bounded above by the diameter of ∆ under l 2 norm, so L(µ) is also bounded above. Contradiction! Therefore, the Claim holds. Step 2. We show lim k→∞ P (Θ k ∈ ∆ h 1-ϵ ) = lim k→∞ µ k (∆ h 1-ϵ ) = θ 0 (h), for all h ∈ H where ∆ h 1-ϵ := {θ ∈ ∆ : θ(h) > 1 -ϵ}. For a fixed h ∈ H, we have: θ 0 (h) (a) = E Θ k (θ(h)) (b) = E θ k ∈∆ h 1-ϵ (θ(h j )) + E θ k ∈∆ u 1-ϵ (θ(h)) + E θ k ∈∆ϵ (θ(h)) (c) ≤ µ k (∆ h 1-ϵ ) • 1 + µ k (∆ u 1-ϵ ) • ϵ + µ k (∆ ϵ ) • 1 = µ k (∆ h 1-ϵ ) + ϵ + µ k (∆ ϵ ) where ∆ u 1-ϵ denotes the union of all the other corners of ∆, i.e. ∆ u 1-ϵ := ∪ h ′ ∈H\h ∆ h ′ 1-ϵ . Here (a) is direct application of Lemma 10; (b) holds since ∆ = ∆ h 1-ϵ ∪∆ u 1-ϵ ∪∆ ϵ . (c) holds because in general θ(h j ) < 1, and θ(h j ) < ϵ for any θ ∈ ∆ u 1-ϵ . Therefore, 0 ≤ θ 0 (h)-µ k (∆ h 1-ϵ ) ≤ ϵ+µ k (∆ ϵ ) → ϵ as k → ∞ hold for any choice of ϵ. Pick a sequence of ϵ → 0, we have that lim k→∞ µ k (∆ h 1-ϵ ) = θ 0 (h).

Hence combining results from

Step 1 and Step 2, we have shown Θ k converges to Y in probability: P (|Θ k -Y | > ϵ) ≤ µ k (∆ ϵ ) + h∈H (θ 0 (h) -µ k (∆ h 1-ϵ )) → 0 as k → ∞. Hence the proof is completed. Corollary 12. Given a fixed data sequence d i sampled from η, if θ k converges to δ h j , then the j-th column of M k converges to η. Proof. For ϵ > 0, there exists N > 0 such that θ k (h j ) > 1-ϵ for any k > N . So j ′ ̸ =j M k (i, j ′ ) < ϵ for any d i ∈ D, on the other hand j ′ M k (i, j ′ ) = η i . This implies that η i -ϵ < M k (i, j) < η i , so M k (i, j) → η i as ϵ → 0. Therefore the j-th column of M k converges to η. Proposition 13. Consider a learning problem with cost C, initial belief θ 0 ∈ P(H), the true hypothesis h * defined by η ∈ P(D). If the learner updates the estimation η k with observed data (sampled from η) as stated above, then belief random variables (Θ k ) k∈N satisfies that for any s > 0, lim k→∞ h∈H P (Θ(h) > 1 -s) = 1. As a consequence, M k as the transport plan has a dominant column (h j ) with total weights > 1 -s, and |(M k ) ij -η k (i)| < s. In fact, as long as the sequence of η k as random variables converges to η in probability, the above proposition holds. Proof. The proof is similar to Step 1 of Theorem 11. The major difference is that data are sampled from η in each step, whereas the learner only has an estimation η k at round k. Therefore, under current condition, equality (b) of Eq 14 need to be modified as following: E d∼η T d (θ k ) = d i ∈D η i • (M k (i, _)\η i k ) = d i ∈D M k (i, _) • η i η i k = θ k ⊙ v k . ( ) where v k = ( η i η i k ) is a vector of the size of the data set D, and ⊙ represents element-wise product. Hence E d∼η f (T d (θ k )) = f (θ k ⊙ v k ) holds for all θ k ∈ ∆. Since η k → η as k → ∞. For any α * β > 0, there exists N > 0 such that for k > N , |1-η i η i k | < α * β 2n . Hence: |f (θ k ⊙v k )-f (θ k )| ≤ α * β 2 . Then corresponding to Eq 16, for k i > N , we have: L(µ ki+1 ) = E θ k i +1 ∈∆ϵ (E d∼η f (T d (θ))) + E θ k i +1 ∈∆\∆ϵ (E d∼η f (T d (θ))) > E θ k i ∈∆ϵ (f (θ k ⊙ v k )) + E θ k i ∈∆\∆ϵ (f (θ k ⊙ v k )) + α * β > E θ k i ∈∆ϵ (f (θ k )) + E θ k i ∈∆\∆ϵ (f (θ k )) - α * β 2 + α * β = L(µ ki ) + α * β 2 . Hence the contradiction on the upper bound of L(µ ki+1 ) still holds, which shows the claim that: lim k→∞ µ k (∆ ϵ ) = 0. So lim k→∞ h∈H P (Θ(h) > 1 -s) = 1. The proof for the second part of the proposition follows exactly as Corollary 12. Proposition 15. For ϵ = (ϵ P , ϵ η , 0) with ϵ P ∈ (0, ∞), as η k → η almost surely, the sequence Θ k of posteriors as a sequence of random variables converges in probability to variable Θ, where P(Θ = v i ) = η(i) and v i = P (i,_) / m j=1 P ij and P = P ϵ (C, η, θ). Therefore, for any s > 0, lim k→∞ h∈H P(|Θ k (h) -1| < s) = 0 for generic (for all but in a closed subset) cost C and η, θ. Proof. First, ϵ θ = 0 means that P ϵ (C, η, θ) is independent of θ. Therefore, M k = P ϵ (C, η k , θ) and has a limit P ϵ (C, η, θ), regardless of the concrete posterior θ k . From construction of GBT, the posterior Θ k is determined by P (Θ k = w i k ) = η(i) where w i k = (M k ) (i,_) / m j=1 (M k ) ij . Given the coupling (Θ k , Θ) by setting only P(Θ k = w i k , Θ = v i ) = η(i) for each i, we may calculate P(|Θ k -Θ| < s) converge to 1 as M k converge to P ϵ (C, η, θ). For generic C, η, θ, the probability of P ϵ (C, η, θ) having a row with only one nonzero entry is 0. Remark: As η k → η almost surely, for any e > 0, there exists N > 0, such that, when k > N , the probability of having η k e-close to η is 1. Thus in almost all episodes, with generic C, η, θ, when e is small enough, for any ||η ′ -η|| < e (using p -∞ norm, same for below), the row-normalized (to where P ϵ r is the row normalization of P ϵ . Therefore, for such e, we may find an N > 0 such that for any k, k ′ > N , P ϵ r (C, η k , θ) ̸ = P ϵ r (C, η ′ k , θ). However, for generic η, say, no entry of η is 0, ||θ k -θ ′ k || < when k, k ′ > N and d k ̸ = d k ′ . Thus the posterior sequence of almost every episode fails to converge.

C ADDITIONAL SIMULATIONS

Interpolation between learning models can be investigated properly under GBT. Human learners appear to be capable of moving between different learning models gradually. Consider an individual at a carnival who is playing a game. At each of 10 trials, a bit of information is provided, but the available reward decreases. The individual has a pool of tickets with which they can bet on the outcome at each trial. The question is how the individual should update their beliefs in order to maximize their rewards. On the first trial, their belief update, in order to accurately reflect the evidence, should follow Bayes rule. However, for the last trial, one should focus bets on the most probable outcome in order to maximize chances for rewards, that is, their beliefs should be optimized for discriminating among the possible outcomes. GBT offers a coherent way of interpolating between these two approaches to provide candidate strategies on the intermediate steps. Such situations are common where there is an explicit constraint on the time horizon after which point no further evidence can be obtained, and there are incentives to act early, rather than to wait until evidence has fully accumulated; for example, identifying dangerous situations (tiger or not? poisonous or not?). We now demonstrate how continuity of GBT (section 3.1) allows one to gradually interpolate between Bayesian and discriminative learning over steps (rather than a sharp switch). C.1 SIMULATION SETUP Suppose a learner who observes data sampled from a true hypothesis P(d|h * ), and needs to make a conclusion on whether h * is one of the hypotheses in H within a fixed number N of observations. Here we compare a baseline learner who utilizes Bayesian inference (ϵ = (1, 0, ∞)) on the first N -1 observations, and switch to discriminative learning (ϵ = (0, ∞, ∞)) on the last observation, against learners who interpolate from Bayesian to discriminative learning gradually along a sequence of models on curves in GBT. Two curves along with intermediate models are shown red and orange in Figure 5 . We take a random sampled M of shape 4 × 4 as an example, Simulation details: We perform 40000 trials in total. For each trial s (or say each episode), we uniformly sample X s ∈ P(H), and let the true hypothesis h * be the covex combination of elements in H with coefficients given by X s . While teaching the episode, in each round, we sample a hypothesis h ∈ H following X s , then sample a data d following the column of M corresponding to d. During inference, we set η k by counting the frequency of each d ∈ D (starting from 1 to avoid 0 in η k ) and then normalize, as stated in (RS) model in Sec. 4.1.

C.2 RESULTS

Following paths shown in Fig. 5 , for baseline (blue, left), path 1 (orange, middle), and path 2 (red, right), the distribution of maximal component of each posterior at round 10 are shown in histograms of 30, and the entropy of these posteriors are plotted in the lower three figures. In the upper figures, comparing to the baseline (blue), weights are concentrated more on the right bars for the gradual interpolations (orange and red). Thus learning tends to be more conclusive along these paths. Here conclusiveness means that the ability of getting a conclusion (one component of the posterior eventually becoming dominant). Furthermore, the entropy distributions shown in the lower figures also illustrate this point, as compare to baseline, gradual interpolations have lower entropy. Numerical results: entropy of baseline: mean 0.1888, standard deviation 0.2858, entropy along path 1: mean 0.0097, standard deviation 0.0686 entropy along path 2: mean 0.0571, standard deviation 0.1584. It is necessary to consider that, the two paths and interpolations are chosen for demonstration purpose, by no means they are optimal. However, we believe GBT is capable of facilitating exploration of such optimization. 



To guarantee the hypotheses are distinguishable, we assume that C does not contain two columns that are only differ by an additive scalar. UOT also generalizes to measures of arbitrary mass, i.e. the total mass of η does not need to equal to θ. Proofs of all claims are included in the Appendices. CONCLUSIONSWe have introduced Generalized Belief Transport (GBT), which unifies and parameterizes classic instances of learning including Bayesian inference, Cooperative Inference, and Discrimination, as Unbalanced Optimal Transport (UOT). We show that each instance is a point in a continuous, differentiable on the interior, 3-dimensional space defined by the regularization parameters of UOT. In addition to supporting generalized learning, we prove and illustrate asymptotic consistency and estimate rates of convergence, including convergence to hypotheses with zero prior support. In summary, GBT unifies very different modes of learning, yielding a powerful, general framework for modeling learning agents. Data, di, are consistent with a hypothesis, hj, when Mij > 0.



Figure 1: The parameter space S of GBT. Parameters ϵ = (ϵP , ϵη, ϵ θ ) can take the value ∞, rendering the corresponding regularization to a strict constraint. The two dashed edges with ϵP = ∞ are not generally well-defined since the limits do not exist. The vertices corresponding to θ ⊗ η, Frequentist (η ⊗ 1) and 1 ⊗ θ are the limits taken along the vertical edges. Given (C, θ, η) as shown in the left corner, each colored map plots each GBT learner (differ by constraints)'s estimation of the mapping between hypotheses and data (UOT plan).

Figure 2: Evidence of general consistency: we plot the percentage of episodes that reaches a threshold (0.999) by round number (in colors of the bars). Each bar represents a size of matrix, for each bar 100 matrices were randomly sampled, and 1000 rounds were simulated per matrix. "exact" means learner uses η k = η, (PS), "update" means learner uses statistics on current data in the episode (RS). "uot" takes ϵ = (1, 40, 40) and "ot" comes with exact and ϵ = (1, ∞, ∞).Theorem 11 (PS). Consider a learning problem with initial belief θ 0 ∈ P(H), and the true hypothesis h * defined by η ∈ P(D). If the learner's data distribution η k = η, then belief random variables (Θ k ) k∈N converge to the random variable Y in probability, where Y = h∈H θ 0 (h)δ h and Y is supported on {δ h } h∈H with P(Y = δ h ) = θ 0 (h) for ϵ η = ϵ θ = ∞ and ϵ P ∈ (0, ∞). Corollary 12. Given a fixed data sequence d i sampled from η, if θ k converges to δ h j , then the j-th column of M k converges to η.

Figure 3: Left: Behavior of models spanning the line segment between BI and CI. With ϵP = 1 and ϵ θ = ∞, when ϵη varies from 0 to ∞, the theory changes from BI to CI. Each bar graphs the Monte-Carlo result of 400,000 teaching sequences, we empirically observe that the coefficients a(h) of the limit in terms of h∈H a(h)δ h changes from BI to CI continuously from δ(h 3 ) by Bernstein-von Mises to θ0(h) by Theorem 11. Right: the Euclidean distances of each coefficient a(h) to BI result (blue crosses), and to CI result (orange dots).

Figure 4: Top: For a learning problem C, behaviors of 9 different learners with ϵP = 1, ϵ θ = ∞ and various ϵη (denoted in figure) on conclusion distributions, a(h) in bar graph, plots below bars are estimated convergence rates E ln(θ k (h)/(1 -θ k (h))) averaged on episodes converging to h, one curve per hypothesis.

we denote this property as ( * ). Structure of det H: Let D = diag(p 1 , p 2 , . . . , p mn ), then det H is a polynomial on p k 's with constant term 0. Each term in det H is of form f (I) k / ∈I p k for each subset I ⊆ {1, 2, . . . , mn}, and the coefficient f (I) = det G (I,I) where G (I,I) is the submatrix with lines of indices not in I, i.e., the entries of G (I,I) are of the form G ij with i ∈ I and j ∈ I.

arg min P ∈U (θ) K(P ) := arg min P ∈U (θ) {⟨C, P ⟩ -H(P )}.(11)whereU (θ) = {P ∈ M(D × H)|P T 1 = θ}. Let λ ∈ R + m ,consider the corresponding Lagrangian problem: L(P, λ) := ⟨C, P ⟩ -H(P ) + ⟨λ, (P T 1 -θ)⟩ Partial derivatives ∂ Pij = 0 and ∂ λj L = 0 result the following system of equations: log P ij -log P (d i |h j + λ j = 0 i P ij -P (h j ) = 0 (12)

15) (c), (d) hold since M k has marginals η, θ. Moreover, equality holds in (a) if and only if T d (θ) = θ for all d ∈ D. Thus rows of M k are the same up to a scalar. This implies either (1) only one column of M k is none zero, thus Θ k ≡ δ h for some h or (2) M k has at least two columns are differed by a scalar.

1 n ) UOT plans max i ||P ϵ r (C, η ′ , θ) (i,_) -P ϵ r (C, η ′ , θ) (i,_) || < 1 4 min i,j ||P ϵ r (C, η, θ) (i,_) -P ϵ r (C, η, θ) (j,_) ||

|H| = |D| = 4. Set N = 10 and start from uniform θ = (0.25, 0.25, 0.25, 0.25).

Figure 5: Baseline (sharp change) and two paths we follow on the parameter space of GBT.

Figure 6: Results. Upper: distribution of maximal component of posterior. Lower: Entropy distribution of posteriors. Left: baseline. Middle: along path 1. Right, along path 2.

ETHIC STATEMENT

The main contributions of this paper are theoretical, rather than practical, in nature. While understanding learning and inference in more unified and generalized ways may have broad impact including causing any ethic problems, nothing is likely to be realized as a direct consequence of this work.

