GENERALIZED BELIEF TRANSPORT

Abstract

Human learners have ability to adopt appropriate learning approaches depending on constraints such as prior on the hypothesis and urgency of decision. However, existing learning models are typically considered individually rather than in relation to one and other. To build agents that have the ability to move between different modes of learning over time, it is important to understand how learning models are related as points in a broader space of possibilities. We introduce a mathematical framework, Generalized Belief Transport (GBT), that unifies and generalizes prior models, including Bayesian inference, cooperative communication and classification, as parameterizations of three learning constraints within Unbalanced Optimal Transport (UOT). We visualize the space of learning models encoded by GBT as a cube which includes classic learning models as special points. We derive critical properties of this parameterized space including proving continuity and differentiability which is the basis for model interpolation, and study limiting behavior of the parameters, which allows attaching learning models on the boundaries. Moreover, we investigate the long-run behavior of GBT, explore convergence properties of models in GBT mathematical and computationally, and formulate conjectures about general behavior. We conclude with open questions and implications for more unified models of learning.

1. INTRODUCTION

Learning and inference are subject to internal and external constraints. Internal constraints include the availability of relevant prior knowledge, which may be brought to bear on inferences based on data. External constraints include the availability of time to accumulate evidence or make the best decision now. Human learners appear to be capable of moving between these constraints as necessary. However, standard models of machine learning tend to view constraints as different problems, which impedes development of unified view of learning agents. Indeed, internal and external constraints on learning map onto classic dichotomies in machine learning. Internal constraints such as the availability of prior knowledge maps onto the Frequentist-Bayesian dichotomy in which the latter uses prior knowledge as a constraint on posterior beliefs, while the former does not. Within Bayesian theory, a classic debate pertains to uninformative, or minimally informative, settings of priors (Jeffreys, 1946; Robert et al., 2009) . External constraints such as availability of time to accumulate evidence versus the need to make the best possible decision now informs the use of generative versus discriminative approaches (Ng and Jordan, 2001) . Despite the fundamental nature of these debates, and the usefulness of all approaches in the appropriate contexts, we are unaware of prior efforts to unify these perspectives and study the full space of possible models. We introduce Generalized Belief Transport (GBT), based on Unbalanced Optimal Transport (Sec. 2), which paramterizes and interpolates between known reasoning modes (Sec. 3.2), with four major contributions. First, we prove continuity in the parameterization and differentiability on the interior of the parameter space (Sec. 3.1). Second, we analyze the behavior under variations in the parameter space (Sec. 3.3). Third, we consider sequential learning, where learners may (not) track the empirically observed data frequencies. And finally we state our theoretical results, simulations and conjectures about the sequential behaviors for various parameters for generic costs and priors (Sec. 4.2). Notations. R ≥0 denotes the non-negative reals. Vector 1 = (1, . . . , 1). The i-th component of vector v is v(i). P(A) is the set of probability distributions over A. For a matrix M , M ij represents its (i, j)-th entry, M (i,_) denotes its i-th row, and M (_,j) denotes its j-th column. Probability is P( • ). Consider a general learning setting: an agent, which we call a learner, updates their belief about the world based on observed data subject to constraints. There is a finite set D = {dfoot_0 , . . . , d n } of all possible data, that defines the interface between the learner and the world. The world is defined by a true hypothesis h * , whose meaning is captured by a probability mapping P(d|h * ) onto observable data. For instance, the world can either be the environment in classic Bayesian inference (Murphy, 2012) or a teacher in cooperative communication (Wang et al., 2020b) . A learner is equipped with a set of hypotheses H = {h 1 , . . . , h m } which may not contain h * ; an initial belief on the hypotheses set, denoted by θ 0 ∈ P(H); and a non-negative cost matrix C = (C ij ) m×n , where C ij measures the underlying cost of mapping d i into h j 1 . The cost matrix can be derived from other matrices that record the relation between D and H, such as likelihood matrices in classic Bayesian inference or consistency matrices in cooperative communication (see details in Section 3.2).This setting reflects an agent's learning constraints: pre-selected hypotheses, and the relations between them and the communication interface (data set). A learner observes data in sequence. At round k, the learner observes a data d k that is sampled from D by the world according to P(d|h * ). Then the learner updates their beliefs over H from θ k-1 to θ k through a learning scheme, where θ k-1 , θ k ∈ P(H). For instance, in Bayesian inference, the learning scheme is defined by Bayes rule; while in discriminative models, the learning scheme is prescribed by a code book. The learner transforms the observed data into a belief on hypotheses h ∈ H with a minimal cost, subject to appropriate constraints, with the goal of learning the exact map P(d|h * ). We can naturally cast this learning problem as Unbalanced Optimal Transport.

2.1. UNBALANCED OPTIMAL TRANSPORT

Unbalanced optimal transport is a generalization of (entropic) Optimal Transport. Optimal transport infers a coupling that minimizes the cost of transporting between two marginal probability distributions (Monge, 1781; Kantorovich, 2006; Villani, 2008) . Entropic Optimal Transport adds a regularization term based on the entropy of the inferred coupling, which has desirable computational consequences (Cuturi, 2013; Peyré and Cuturi, 2019) . Unbalanced OT further relaxes the problem by allowing one to approximately match marginal probability distributions. Let η = (η(1), . . . , η(n)) and θ = (θ(1), . . . , θ(m)) be two probability distributions. A joint distribution matrix P = (P ij ) n×m is called a transport plan or coupling between η and θ if P has η and θ as its marginals. Given a cost matrix C = (C ij ) n×m ∈ (R ≥0 ) m×n , Entropy regularized optimal transport (EOT) (Cuturi, 2013) solves the optimal transport plan P ϵ P that minimizes the entropy regularized cost of transporting η into θ. Thus for a parameter ϵ P > 0: P ϵ P = arg min P ∈U (η,θ) ⟨C, P ⟩ -ϵ P H(P ), where U (η, θ) is the set of all transport plans between η and θ, ⟨C, P ⟩ = i,j C ij P ij is the inner product between C and P , and H(P ) = -ij P ij log P ij + P ij is the entropy of P . Unbalanced Optimal Transport (UOT), introduced by Liero et al. ( 2018), is a generalization of EOT that relaxes the marginal constraints. The degree of relaxation is controlled by two regularization terms. Formally, for non-negative scalar parameters ϵ = (ϵ P , ϵ η , ϵ θ ), the UOT plan is, P ϵ (C, η, θ) = arg min P ∈(R ≥0 ) n×m {⟨C, P ⟩ -ϵ P H(P ) + ϵ η KL(P 1|η) + ϵ θ KL(P T 1|θ)}. (1) Here KL(a|b) := i a i log(a i /b i ) -a i + b i is the Kullback-Leibler divergence between vectors. UOT differs from EOT in relaxing the hard constraint that P satisfy the given marginals η and θ, to soft constraints that penalize the marginals being far from η or θfoot_1 . In particular, as ϵ η and ϵ θ → ∞, we recover the EOT problem.



To guarantee the hypotheses are distinguishable, we assume that C does not contain two columns that are only differ by an additive scalar. UOT also generalizes to measures of arbitrary mass, i.e. the total mass of η does not need to equal to θ.

